Lab Assignment Five: Wide and Deep Network Architectures¶

  • Group: Lab One 3
    • Salissa Hernandez
    • Juan Carlos Dominguez
    • Leonardo Piedrahita
    • Brice Danvide

Wide and Deep Network Architectures combine the strengths of shallow models for memorization and deep models for generalization. The wide component captures feature interactions via cross-product embeddings, while the deep component uses multiple layers to learn complex, high-dimensional patterns. This architecture is particularly suited for datasets with heterogeneous features, combining categorical and numerical data effectively. In contrast, a Multi-Layer Perceptron (MLP) is a fully connected deep neural network that processes data without explicit feature crossing, relying entirely on its deep layers to learn feature interactions.

Three Wide and Deep architectures are designed and trained with varying crossed columns in the wide component and different numbers of layers in the deep branch, including one model with at least 10 layers, to investigate generalization performance. The analysis incorporates feature engineering techniques such as creating cross-product embeddings to enhance interactions between categorical features and normalizing numerical features for the deep component. Model evaluation is conducted using metrics like AUC and ROC curves, providing a detailed assessment of classification performance and decision boundaries. Stratified 10-fold cross-validation ensures robust evaluation, while dimensionality reduction techniques like Principal Component Analysis (PCA) visualize embedding separability. Key insights are drawn from cluster analysis, silhouette scores, and metric comparisons, providing recommendations for architectural improvements and dataset-specific optimizations. This analysis emphasizes clear assumptions, reproducibility, and comprehensive evaluation, serving as a complete, reproducible, and insightful study.

The dataset used is the following:

  • https://www.kaggle.com/datasets/mysarahmadbhat/mercedes-used-car-listing

It features detailed table data about used Mercedes-Benz cars, including categorical features such as model, fuel type, and transmission, as well as numerical features like mileage, engine size, and price. The dataset supports multi-class classification, with irrelevant features removed and categorical features one-hot encoded for the wide component and numerical features normalized for the deep component. Cross-product embeddings are created for selected features to enhance model performance by capturing feature interactions. Stratified 10-fold cross-validation is used for splitting the data, ensuring consistent class representation in each fold and providing a realistic mirroring of how the model would be used in practice. This dataset provides a diverse feature space, making it an ideal choice for analyzing the effectiveness of Wide and Deep architectures.

1. Preparation¶

1.1 Defining & Preparing Class Variables¶

In [ ]:
# Importing packages
import numpy as np
import pandas as pd
import missingno as mn
import warnings

# Suppress all warnings
warnings.filterwarnings("ignore")

# Scikit-Learn
from sklearn.preprocessing import LabelEncoder, StandardScaler, label_binarize
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.metrics import roc_curve, auc, roc_auc_score, silhouette_score
from sklearn.decomposition import PCA
from scipy import stats
from scipy.stats import ttest_rel, wilcoxon

# Tensorflow Keras
import tensorflow as tf
from keras.models import Sequential, Model
from keras.layers import Dense, Input, concatenate
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping

# Visualizations
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
df = pd.read_csv('../../Data/merc.csv')
df.head(10)
Out[2]:
model year price transmission mileage fuelType tax mpg engineSize
0 SLK 2005 5200 Automatic 63000 Petrol 325 32.1 1.8
1 S Class 2017 34948 Automatic 27000 Hybrid 20 61.4 2.1
2 SL CLASS 2016 49948 Automatic 6200 Petrol 555 28.0 5.5
3 G Class 2016 61948 Automatic 16000 Petrol 325 30.4 4.0
4 G Class 2016 73948 Automatic 4000 Petrol 325 30.1 4.0
5 SL CLASS 2011 149948 Automatic 3000 Petrol 570 21.4 6.2
6 GLE Class 2018 30948 Automatic 16000 Diesel 145 47.9 2.1
7 S Class 2012 10948 Automatic 107000 Petrol 265 36.7 3.5
8 G Class 2019 139948 Automatic 12000 Petrol 145 21.4 4.0
9 GLA Class 2017 19750 Automatic 15258 Diesel 30 64.2 2.1
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13119 entries, 0 to 13118
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   model         13119 non-null  object 
 1   year          13119 non-null  int64  
 2   price         13119 non-null  int64  
 3   transmission  13119 non-null  object 
 4   mileage       13119 non-null  int64  
 5   fuelType      13119 non-null  object 
 6   tax           13119 non-null  int64  
 7   mpg           13119 non-null  float64
 8   engineSize    13119 non-null  float64
dtypes: float64(2), int64(4), object(3)
memory usage: 922.6+ KB
In [4]:
df.describe()
Out[4]:
year price mileage tax mpg engineSize
count 13119.000000 13119.000000 13119.000000 13119.000000 13119.000000 13119.000000
mean 2017.296288 24698.596920 21949.559037 129.972178 55.155843 2.071530
std 2.224709 11842.675542 21176.512267 65.260286 15.220082 0.572426
min 1970.000000 650.000000 1.000000 0.000000 1.100000 0.000000
25% 2016.000000 17450.000000 6097.500000 125.000000 45.600000 1.800000
50% 2018.000000 22480.000000 15189.000000 145.000000 56.500000 2.000000
75% 2019.000000 28980.000000 31779.500000 145.000000 64.200000 2.100000
max 2020.000000 159999.000000 259000.000000 580.000000 217.300000 6.200000
In [5]:
# Returns the dimensions of the dataframe as (number of rows, number of columns)
df.shape
Out[5]:
(13119, 9)
In [6]:
# Returns an index object containing the col labels of the dataframe
df.columns
Out[6]:
Index(['model', 'year', 'price', 'transmission', 'mileage', 'fuelType', 'tax',
       'mpg', 'engineSize'],
      dtype='object')
In [7]:
# Clean column names: make them lowercase and replace spaces with underscores
df.columns = df.columns.str.replace(r'(?<!^)(?=[A-Z])', '_', regex=True).str.lower()

# Check the updated column names
print(df.columns)
Index(['model', 'year', 'price', 'transmission', 'mileage', 'fuel_type', 'tax',
       'mpg', 'engine_size'],
      dtype='object')

Checking for Duplicate Values¶

In [8]:
# Checking for duplicates
duplicates_before = df.duplicated().sum()
print(f'Duplicates before dropping: {duplicates_before}')
Duplicates before dropping: 259
In [9]:
# Dropping duplicates
df.drop_duplicates(inplace=True)
In [10]:
# No more duplicates!
duplicates_after = df.duplicated().sum()
print(f'Duplicates after dropping: {duplicates_after}')
Duplicates after dropping: 0

Checking for Missing/Null Values¶

In [11]:
# Show missing data
mn.matrix(df)
Out[11]:
<Axes: >
No description has been provided for this image
In [12]:
# Checking for null values
df.isnull().sum()
Out[12]:
model           0
year            0
price           0
transmission    0
mileage         0
fuel_type       0
tax             0
mpg             0
engine_size     0
dtype: int64
In [13]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 12860 entries, 0 to 13118
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   model         12860 non-null  object 
 1   year          12860 non-null  int64  
 2   price         12860 non-null  int64  
 3   transmission  12860 non-null  object 
 4   mileage       12860 non-null  int64  
 5   fuel_type     12860 non-null  object 
 6   tax           12860 non-null  int64  
 7   mpg           12860 non-null  float64
 8   engine_size   12860 non-null  float64
dtypes: float64(2), int64(4), object(3)
memory usage: 1004.7+ KB

Checking for Outliers¶

In [14]:
# Checking For Outliers
df.describe()
Out[14]:
year price mileage tax mpg engine_size
count 12860.000000 12860.000000 12860.000000 12860.000000 12860.000000 12860.000000
mean 2017.267963 24636.426361 22169.588336 129.843701 55.197535 2.075381
std 2.226127 11874.220447 21077.039295 65.580514 15.181133 0.573434
min 1970.000000 650.000000 1.000000 0.000000 1.100000 0.000000
25% 2016.000000 17309.750000 6494.000000 125.000000 45.600000 1.800000
50% 2018.000000 22299.000000 15448.500000 145.000000 56.500000 2.000000
75% 2019.000000 28971.250000 32000.000000 145.000000 64.200000 2.100000
max 2020.000000 159999.000000 259000.000000 580.000000 217.300000 6.200000
In [15]:
# Defines upper and lower bounds for each column
df = df[
    (df['price'] >= 1000) & (df['price'] <= 60000) &         # Filter price between 1,000 and 60,000
    (df['mileage'] <= 150000) &                              # Filter mileage below 150,000
    (df['tax'] <= 300) &                                     # Filter tax below 300
    (df['mpg'] >= 10) & (df['mpg'] <= 100) &                 # Filter mpg between 10 and 100
    (df['engine_size'] > 0) & (df['engine_size'] <= 5)         # Filter engineSize between 0 and 5 liters
]
In [16]:
# Outliers Removed!
df.describe()
Out[16]:
year price mileage tax mpg engine_size
count 12351.000000 12351.000000 12351.000000 12351.000000 12351.000000 12351.000000
mean 2017.353494 23891.015707 21743.589507 126.171160 55.067776 2.027107
std 1.953895 9455.640104 19996.533334 54.209434 11.558749 0.463277
min 1997.000000 1350.000000 1.000000 0.000000 24.600000 1.300000
25% 2016.000000 17299.000000 6620.500000 125.000000 46.300000 1.600000
50% 2018.000000 22156.000000 15329.000000 145.000000 56.500000 2.000000
75% 2019.000000 28480.000000 31549.000000 145.000000 64.200000 2.100000
max 2020.000000 59999.000000 150000.000000 300.000000 80.700000 4.700000

Evaluation of Filtering Criteria¶

Objective: The goal of the filtering criteria is to eliminate outliers that could skew the analysis and predictive modeling of car prices based on various attributes, such as price, mileage, and engine size.

1. Price Filter:¶

  • Criteria: Price is filtered between €1,000 and €60,000.
  • Rationale:
    • Lower Bound: Setting a minimum price of €1,000 helps exclude listings that may be erroneous (e.g., missing data or extreme discounts).
    • Upper Bound: The maximum price of €60,000 is aimed at excluding luxury and exotic cars that may not represent the typical market for used Mercedes vehicles. The mean price post-filtering is €23,891, indicating that the filtered dataset contains more reasonably priced vehicles.

2. Mileage Filter:¶

  • Criteria: Mileage is capped at 150,000 km.
  • Rationale:
    • High mileage often indicates extensive use and potential wear, which could correlate negatively with price. By limiting mileage to a maximum of 150,000 km, the dataset now represents vehicles that are more commonly sold in the used car market, improving the relevance of the data for predictive modeling. The mean mileage remains within a practical range (21,743 km).

3. Tax Filter:¶

  • Criteria: Tax is limited to a maximum of €300.
  • Rationale:
    • This upper bound ensures that extremely high taxes, which might apply to specialty vehicles or those with high emissions, are excluded. The average tax remains reasonable at €126, supporting the filtering effectiveness.

4. MPG Filter:¶

  • Criteria: MPG is filtered between 10 and 100.
  • Rationale:
    • Setting a minimum of 10 MPG avoids extremely inefficient vehicles that may not be practical for buyers. The maximum of 100 MPG is a logical upper limit, as cars with exceptionally high MPG are often hybrids or very efficient models that may skew predictions. The mean MPG of 55.07 suggests that the dataset retains efficient vehicles.

5. Engine Size Filter:¶

  • Criteria: Engine size is limited to between 0 and 5 liters.
  • Rationale:
    • This range encompasses the vast majority of passenger vehicles while excluding high-performance or commercial vehicles that fall outside the typical used car market. The mean engine size of 2.03 liters is consistent with average passenger vehicles.

Conclusion¶

The filtering criteria employed appear to be effective in removing outliers and retaining a dataset that is representative of the used car market. The adjustments made through these criteria led to a more focused dataset, evidenced by reasonable means and ranges for each variable.

In [17]:
# Resetting the index
df = df.reset_index(drop=True)
In [18]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12351 entries, 0 to 12350
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   model         12351 non-null  object 
 1   year          12351 non-null  int64  
 2   price         12351 non-null  int64  
 3   transmission  12351 non-null  object 
 4   mileage       12351 non-null  int64  
 5   fuel_type     12351 non-null  object 
 6   tax           12351 non-null  int64  
 7   mpg           12351 non-null  float64
 8   engine_size   12351 non-null  float64
dtypes: float64(2), int64(4), object(3)
memory usage: 868.6+ KB

Visualizations for Categorical Attributes¶

Transmission¶

In [ ]:
# Sets a Seaborn style
sns.set(style="whitegrid")

# Defines colors
colors = ['#1E90FF', '#00CED1', '#20B2AA', '#3CB371', '#4682B4', '#5F9EA0', '#87CEEB', '#00BFFF']
transmission_counts = df.transmission.value_counts()

# Filters out categories with zero counts (if any)
transmission_counts = transmission_counts[transmission_counts > 0]

# Calculates percentages
percentages = 100 * transmission_counts / transmission_counts.sum()

# Creates labels with percentages, hiding those below 1%
labels = []
for label, pct in zip(transmission_counts.index, percentages):
    if pct < 1:
        labels.append("")  # Sets empty for small percentages
    else:
        labels.append(f"{label} ({pct:.1f}%)")

# Creates the figure and axis
fig, ax = plt.subplots(figsize=(10, 6))

# Creates the pie chart
wedges, texts = ax.pie(transmission_counts, 
                        labels=labels, 
                        startangle=90, 
                        colors=colors[:len(transmission_counts)],
                        wedgeprops=dict(edgecolor='black', alpha=0.9))

# Styles the text labels
for text in texts:
    text.set_fontsize(14)
    text.set_color('black')

# Sets the title
plt.title('Distribution of Transmission Type', fontsize=25, fontweight='bold', color='black', pad=20)

# Customizes the figure background color
fig.patch.set_facecolor('#f6f5f5')

# Displays the pie chart
plt.show()
No description has been provided for this image

Model¶

In [ ]:
# Sets a Seaborn style
sns.set(style="whitegrid")

# Gets counts for all models
model_counts = df.model.value_counts()
total_counts = model_counts.sum()

# Calculates percentages
percentages = (model_counts / total_counts) * 100

# Creates the figure and axis
fig, ax = plt.subplots(figsize=(12, 8))

# Determines colors: unique colors for the top three percentages, grey for the rest
colors = ['#1E90FF', '#00CED1', '#20B2AA']  # Distinct colors for the top three
grey_color = '#c4c4c4'  # Grey for the rest
bar_colors = [grey_color] * len(percentages)

# Gets indices of the top three models
top_three_indices = percentages.nlargest(3).index
for i in range(len(percentages)):
    if percentages.index[i] in top_three_indices:
        bar_colors[i] = colors.pop(0)  # Assigns a distinct color

# Creates vertical bars
bars = ax.bar(percentages.index, percentages.values, color=bar_colors, alpha=0.9, edgecolor='black')

# Adds annotations for the percentage labels on top of the bars
for bar in bars:
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width() / 2, height + 1, f'{height:.1f}%', 
            ha='center', fontsize=10, fontweight='bold', color='black') 

# Sets the title
plt.title('Distribution of Car Models (Percentage)', fontsize=25, fontweight='bold', color='black', pad=20)

# Customizes the axes
ax.set_xlabel('Car Models', fontsize=14)
ax.set_ylabel('Percentage (%)', fontsize=14)

# Rotates x-tick labels to vertical for better alignment
plt.xticks(rotation=90, ha='center', fontsize=12)  # Sets rotation to 90 for vertical

# Customizes the figure background color
fig.patch.set_facecolor('#f6f5f5')
ax.set_facecolor('#f6f5f5')

# Adds gridlines for better readability
ax.yaxis.grid(True, which='both', linestyle='--', linewidth=0.7, color='gray')

# Hides the spines for a cleaner look
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.show()
No description has been provided for this image

Engine Size¶

In [ ]:
# Sets a Seaborn style
sns.set(style="whitegrid")

# Defines a cooler color palette
colors = ['#1E90FF', '#00CED1', '#20B2AA'] + ['#c4c4c4'] * 5  # Grey for the rest

# Gets counts for fuel types
fuel_counts = df.fuel_type.value_counts()

# Calculates percentages
fuel_percentages = (fuel_counts / fuel_counts.sum()) * 100

# Creates the figure and axis
fig, ax = plt.subplots(figsize=(10, 6))

# Creates vertical bars
bars = ax.bar(fuel_percentages.index, fuel_percentages.values, color=colors[:len(fuel_percentages)], alpha=0.9, edgecolor='black')

# Adds annotations for the percentage labels
for bar in bars:
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width() / 2, height + 1, f'{height:.1f}%', 
            ha='center', va='bottom', fontsize=12, fontweight='bold', color='black')

# Sets the title
plt.title('Distribution of Fuel Types', fontsize=25, fontweight='bold', color='black', pad=20)

# Customizes the x and y axis
ax.set_ylabel('Percentage (%)', fontsize=14)
ax.set_xlabel('Fuel Type', fontsize=14)

# Customizes the figure background color
fig.patch.set_facecolor('#f6f5f5')
ax.set_facecolor('#f6f5f5')

# Adds gridlines for better readability
ax.yaxis.grid(True, which='both', linestyle='--', linewidth=0.7, color='gray')

# Hides the spines for a cleaner look
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

# Rotates the x labels to be vertical
plt.xticks(rotation=90)

plt.show()
No description has been provided for this image

Visualizations for Numerical Attributes¶

In [ ]:
# Sets up the figure
fig = plt.figure(figsize=(15, 6))
fig.patch.set_facecolor('#f5f6f6')

# Creates a grid for the subplots
gs = fig.add_gridspec(2, 3)
gs.update(wspace=0.2, hspace=0.2)

# Creates subplots
ax0 = fig.add_subplot(gs[0, 0])
ax1 = fig.add_subplot(gs[0, 1])
ax2 = fig.add_subplot(gs[0, 2])
ax3 = fig.add_subplot(gs[1, 0])
ax4 = fig.add_subplot(gs[1, 1])
ax5 = fig.add_subplot(gs[1, 2])

axes = [ax0, ax1, ax2, ax3, ax4, ax5]
for ax in axes:
    ax.set_facecolor('#f5f6f6')
    ax.tick_params(axis='x', labelsize=12, which='major', direction='out', pad=2, length=1.5)
    ax.tick_params(axis='y', colors='black')
    ax.axes.get_yaxis().set_visible(False)

    for loc in ['left', 'right', 'top', 'bottom']:
        ax.spines[loc].set_visible(False)

# Selects numerical columns
cols = df.select_dtypes(exclude='object').columns

# Plots KDE for each numerical attribute
sns.kdeplot(x=df[cols[0]], color="green", fill=True, ax=ax0)
sns.kdeplot(x=df[cols[1]], color="red", fill=True, ax=ax1)
sns.kdeplot(x=df[cols[2]], color="blue", fill=True, ax=ax2)
sns.kdeplot(x=df[cols[3]], color="black", fill=True, ax=ax3)
sns.kdeplot(x=df[cols[4]], color="pink", fill=True, ax=ax4)
sns.kdeplot(x=df[cols[5]], color="orange", fill=True, ax=ax5)

# Adds titles and texts
fig.text(0.2, 0.98, "KDE Visualizations on Numerical Attributes:", **{'font': 'serif', 'size': 18, 'weight': 'bold'}, alpha=1)

plt.show()
No description has been provided for this image

Encoding the Target Attribute: price¶

In [ ]:
# Defines bins and labels
bins = [0, 10000, 20000, 30000, 40000, 50000, df['price'].max()]
labels = ['Budget', 'Affordable', 'Mid-Range', 'High-End', 'Premium', 'Luxury']

# Uses pd.cut to bin the 'price' and assign categories with an explicit order
df['price'] = pd.cut(df['price'], bins=bins, labels=labels, include_lowest=True)

# Explicitly defines the order of the categories
ordered_labels = pd.Categorical(df['price'], categories=labels, ordered=True)

# Assigns the ordered categories back to the 'price' column
df['price'] = ordered_labels

# Now, manually encodes the categories as integers
df['price_encoded'] = df['price'].cat.codes

# Checks the unique values in the encoded 'price' column
print("Encoded 'price' values:")
print(df['price_encoded'].unique())

# Checks the mapping of the labels to the encoded values
price_mapping = dict(zip(df['price'].cat.categories, range(len(df['price'].cat.categories))))
print("\nPrice Category Encoding Mapping:", price_mapping)
Encoded 'price' values:
[3 1 2 5 0 4]

Price Category Encoding Mapping: {'Budget': 0, 'Affordable': 1, 'Mid-Range': 2, 'High-End': 3, 'Premium': 4, 'Luxury': 5}
In [ ]:
# Gets the counts of the encoded 'price' values
price_category_counts = df['price_encoded'].value_counts(normalize=True) * 100  # Normalize to get percentages

# Gets the labels corresponding to the numeric encoding
price_labels = df['price'].cat.categories  # Get the price categories

# Sorts the price_category_counts so it matches the order of price_labels
price_category_counts = price_category_counts.sort_index()  # Sort by index to match the category order

# Plots a bar chart
plt.figure(figsize=(10, 6))
price_category_counts.plot(kind='bar', color='skyblue', edgecolor='black')

# Adds labels and title
plt.title('Percentage Distribution of Price Categories', fontsize=18)
plt.xlabel('Price Category', fontsize=14)
plt.ylabel('Percentage (%)', fontsize=14)

# Sets the x-ticks to the correct category labels
plt.xticks(ticks=range(len(price_labels)), labels=price_labels, rotation=45)

# Shows percentage values on each bar
for index, value in enumerate(price_category_counts):
    plt.text(index, value + 0.5, f'{value:.1f}%', ha='center', va='bottom', fontsize=12, fontweight='bold')

# Displays the plot
plt.tight_layout()
plt.show()
No description has been provided for this image

Encoding Categorical Attributes¶

Model¶

In [ ]:
# Removes leading and trailing spaces from the 'model' column
df['model'] = df['model'].str.strip()

# Sorts the categories alphabetically
sorted_labels = sorted(df['model'].unique())

# Creates a Categorical type with sorted categories
df['model'] = pd.Categorical(df['model'], categories=sorted_labels, ordered=True)

# Encodes the 'model' column
df['model_encoded'] = df['model'].cat.codes

# Checks the mapping of the labels to the encoded values
model_mapping = dict(zip(df['model'].cat.categories, range(len(df['model'].cat.categories))))
print("\nModel Encoding Mapping:", model_mapping)
Model Encoding Mapping: {'180': 0, '200': 1, '220': 2, 'A Class': 3, 'B Class': 4, 'C Class': 5, 'CL Class': 6, 'CLA Class': 7, 'CLC Class': 8, 'CLK': 9, 'CLS Class': 10, 'E Class': 11, 'GL Class': 12, 'GLA Class': 13, 'GLB Class': 14, 'GLC Class': 15, 'GLE Class': 16, 'GLS Class': 17, 'M Class': 18, 'S Class': 19, 'SL CLASS': 20, 'SLK': 21, 'V Class': 22, 'X-CLASS': 23}

Transmission¶

In [ ]:
# Creates a Categorical type with the unique transmission values in the original order
df['transmission'] = pd.Categorical(df['transmission'], ordered=True)

# Encodes the 'transmission' column
df['transmission_encoded'] = df['transmission'].cat.codes

# Checks the unique encoded 'transmission' values
print("Encoded 'transmission' values:")
print(df['transmission_encoded'].unique())

# Checks the mapping of the labels to the encoded values
transmission_mapping = dict(zip(df['transmission'].cat.categories, range(len(df['transmission'].cat.categories))))
print("\nTransmission Encoding Mapping:", transmission_mapping)
Encoded 'transmission' values:
[0 1 3 2]

Transmission Encoding Mapping: {'Automatic': 0, 'Manual': 1, 'Other': 2, 'Semi-Auto': 3}

Fuel Type¶

In [ ]:
# Creates a Categorical type with the unique fuel types in the original order
df['fuel_type'] = pd.Categorical(df['fuel_type'], ordered=True)

# Encodes the 'fuel_type' column
df['fuel_type_encoded'] = df['fuel_type'].cat.codes

# Checks the mapping of the labels to the encoded values
fuel_type_mapping = dict(zip(df['fuel_type'].cat.categories, range(len(df['fuel_type'].cat.categories))))
print("\nFuel Type Encoding Mapping:", fuel_type_mapping)
Fuel Type Encoding Mapping: {'Diesel': 0, 'Hybrid': 1, 'Other': 2, 'Petrol': 3}

Encoding Numerical Attributes¶

MPG¶

In [ ]:
# Checks the range of 'mpg' values
print("Minimum mpg value:", df['mpg'].min())
print("Maximum mpg value:", df['mpg'].max())

# Defines new bin edges that cover the entire range of 'mpg' values
bins = [0, 25, 35, 45, 55, 65, 75, 85]  # Adjust these based on the actual range
labels = ['Very Low', 'Low', 'Medium', 'High', 'Very High', 'Excellent', 'Top Tier']

# Creates a new column in the DataFrame for the binned mpg values
df['mpg_binned'] = pd.cut(df['mpg'], bins=bins, labels=labels, right=False)

# Checks the distribution after binning
print(df['mpg_binned'].value_counts(dropna=False))

# Defines ordered categories and encode them
df['mpg_binned'] = pd.Categorical(df['mpg_binned'], categories=labels, ordered=True)
df['mpg_encoded'] = df['mpg_binned'].cat.codes  # -1 will appear if there are values outside the bins

# Checks the unique encoded 'mpg' values and their mapping
mpg_mapping = dict(zip(df['mpg_binned'].cat.categories, range(len(df['mpg_binned'].cat.categories))))
print("\nMPG Encoding Mapping:", mpg_mapping)

# Displays the encoded values distribution
print(df['mpg_encoded'].value_counts())
Minimum mpg value: 24.6
Maximum mpg value: 80.7
mpg_binned
Very High    3846
High         2953
Excellent    2869
Medium       1924
Low           674
Top Tier       83
Very Low        2
Name: count, dtype: int64

MPG Encoding Mapping: {'Very Low': 0, 'Low': 1, 'Medium': 2, 'High': 3, 'Very High': 4, 'Excellent': 5, 'Top Tier': 6}
mpg_encoded
4    3846
3    2953
5    2869
2    1924
1     674
6      83
0       2
Name: count, dtype: int64

Year¶

In [ ]:
# Defines bins and labels for decades
year_bins = [1990, 2000, 2010, 2020, 2030]  # Adjusted to cover the full range of years
year_labels = ['1990s', '2000s', '2010s', '2020s']

# Creates a new column in the DataFrame for the binned year values
df['year_binned'] = pd.cut(df['year'], bins=year_bins, labels=year_labels, right=False)

# Checks the distribution after binning
print(df['year_binned'].value_counts(dropna=False))

# Defines ordered categories and encode them
df['year_binned'] = pd.Categorical(df['year_binned'], categories=year_labels, ordered=True)
df['year_encoded'] = df['year_binned'].cat.codes  # -1 will appear if there are values outside the bins

# Checks the unique encoded year values and their mapping
year_mapping = dict(zip(df['year_binned'].cat.categories, range(len(df['year_binned'].cat.categories))))
print("\nYear Encoding Mapping:", year_mapping)

# Displays the encoded values distribution
print(df['year_encoded'].value_counts())
year_binned
2010s    11717
2020s      586
2000s       43
1990s        5
Name: count, dtype: int64

Year Encoding Mapping: {'1990s': 0, '2000s': 1, '2010s': 2, '2020s': 3}
year_encoded
2    11717
3      586
1       43
0        5
Name: count, dtype: int64

Engine Size¶

In [ ]:
# Scaling Engine Size
scaler = StandardScaler()
df['engine_size_scaled'] = scaler.fit_transform(df[['engine_size']])

Tax¶

In [ ]:
# Adjusts bin edges and labels to ensure coverage of all values
tax_bins = [0, 50, 100, 150, 250, 301] 
tax_labels = ['Very Low', 'Low', 'Medium', 'High', 'Very High']

# Apply binning with adjusted bins
df['tax_binned'] = pd.cut(df['tax'], bins=tax_bins, labels=tax_labels, right=False)

# Re-encodes the binned values
df['tax_binned'] = pd.Categorical(df['tax_binned'], categories=tax_labels, ordered=True)
df['tax_encoded'] = df['tax_binned'].cat.codes

# Checks the unique encoded tax values and their mapping
tax_mapping = dict(zip(df['tax_binned'].cat.categories, range(len(df['tax_binned'].cat.categories))))
print("\ntax Encoding Mapping:", tax_mapping)

# Checks the distribution of 'tax_encoded' after binning and encoding
print(df['tax_encoded'].value_counts())
tax Encoding Mapping: {'Very Low': 0, 'Low': 1, 'Medium': 2, 'High': 3, 'Very High': 4}
tax_encoded
2    8300
0    2298
3    1499
4     254
Name: count, dtype: int64

Mileage¶

In [ ]:
# Scaling Mileage
scaler = StandardScaler()
df['mileage_scaled'] = scaler.fit_transform(df[['mileage']])
In [33]:
# Dropping Attributes that were Encoded
df = df.drop(columns=['price', 'model', 'transmission', 'fuel_type', 'mpg', 'mpg_binned', 'year', 'year_binned', 'engine_size','tax','tax_binned', 'mileage'])

Preprocessing Summary:¶

  1. Original Data:

    • The original dataset contains a mix of categorical and numerical columns, including: model, year, price, transmission, mileage, fuelType, tax, mpg, and engineSize.
    • Categorical columns: model, transmission, fuelType.
    • Numerical columns: price, mileage, tax, mpg, engineSize, year.
  2. Transformations and Adjustments:

    • Encoding Categorical Variables:
      • The model column was encoded using integer labels representing different car models.
      • The transmission column was encoded as 0 for automatic and 1 for manual transmissions.
      • The fuelType column was encoded with integers for different types of fuel, such as Petrol, Diesel, and Hybrid.
      • The mpg (miles per gallon) values were binned into categories like "Very Low", "Low", and so on, and subsequently encoded as integers for compatibility with machine learning models.
    • Handling Numerical Features:
      • Numerical features like price, year, mileage, tax, and engineSize were either binned or scaled for better use in modeling.
      • Binning: The year column was grouped into decades (e.g., 2010-2019) and encoded as numbers.
      • Scaling: Standard scaling was applied to engineSize and mileage due to their wide range of values.
      • Encoding Tax: The tax column was grouped into categories (e.g., "Very Low", "Low"), then encoded into numerical values.
  3. Scaled Values:

    • For the engine_size_scaled feature, standard scaling was applied to engineSize so that it has a mean of 0 and a standard deviation of 1.
    • Similarly, mileage was scaled to ensure it is on the same scale as other numerical features, improving compatibility with the machine learning models.
  4. Encoded Variables:

    • The price_encoded variable was created by encoding the price values into different ranges, such as "Low", "Medium", and "High".
    • The categorical columns (model_encoded, transmission_encoded, fuel_type_encoded, mpg_encoded, tax_encoded) were all encoded into numerical values for model input.

Preprocessed Dataframe¶

In [34]:
df_preprocessed = df.copy()
In [35]:
df_preprocessed.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12351 entries, 0 to 12350
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   price_encoded         12351 non-null  int8   
 1   model_encoded         12351 non-null  int8   
 2   transmission_encoded  12351 non-null  int8   
 3   fuel_type_encoded     12351 non-null  int8   
 4   mpg_encoded           12351 non-null  int8   
 5   year_encoded          12351 non-null  int8   
 6   engine_size_scaled    12351 non-null  float64
 7   tax_encoded           12351 non-null  int8   
 8   mileage_scaled        12351 non-null  float64
dtypes: float64(2), int8(7)
memory usage: 277.5 KB
In [36]:
df_preprocessed.head(10)
Out[36]:
price_encoded model_encoded transmission_encoded fuel_type_encoded mpg_encoded year_encoded engine_size_scaled tax_encoded mileage_scaled
0 3 19 0 1 4 2 0.157348 0 0.262877
1 3 16 0 0 3 2 0.157348 2 -0.287241
2 1 19 0 3 2 2 3.179422 4 4.263732
3 1 13 0 0 4 2 0.157348 0 -0.324349
4 3 3 0 3 2 3 -0.058514 2 -1.057105
5 2 3 0 0 4 3 -1.137826 2 -1.037401
6 2 4 0 0 4 2 -0.058514 2 -1.073509
7 1 13 0 0 4 2 0.157348 2 0.998684
8 1 4 0 0 5 2 -1.137826 3 0.154904
9 1 5 0 0 4 2 0.157348 3 0.426962

Final Dataset for Classification:¶

Data Preprocessing:¶

The dataset initially contained both categorical and numerical variables. The relevant features were identified, including price, model, year, transmission, mileage, fuelType, tax, mpg, and engineSize. Missing values were handled as necessary, and the variables were cleaned for consistency.

Feature Engineering:¶

  • Target Variable Transformation: The price variable was transformed into a categorical variable (price_encoded) through binning, grouping the prices into discrete categories (e.g., Very Low, Low, Medium, High). This transformation was done to convert the problem from regression to classification, as the goal is now to predict which price category a given car will fall into based on the other features.
  • Categorical Encoding: Several categorical variables were encoded into numerical values for easier use in the machine learning model. transmission and fuelType were label-encoded, and model was also label-encoded to represent the car models numerically.
  • Numerical Transformation: Variables such as engineSize and mileage were scaled using standard scaling to ensure all numerical features were on the same scale. Additionally, mileage was transformed into a new scaled variable.
  • Binning of Year: The year column was binned by decades (e.g., 2000s, 2010s) to group the data into more manageable categories, reducing the influence of specific years on the model.

Dimensionality Reduction/Feature Selection:¶

  • Binning of Numerical Variables: Initially, there was consideration to bin numerical variables like mileage, but due to the skewed distribution, scaling was applied instead.

Final Dataset:¶

The final dataset, prepared for classification, contains transformed features such as price_encoded, model_encoded, transmission_encoded, fuel_type_encoded, and year_encoded. Newly created variables like price_encoded and mileage_scaled were included to provide the model with relevant information for classification. The dataset is now structured with categorical and numerical variables, all in the correct format for use in a classification model.

Classification Approach:¶

Given that the target variable price has been transformed into discrete categories (e.g., Very Low, Low, Medium, High), the task is now a classification problem, rather than a regression. The goal is to predict which price category a car will fall into based on its features. Feature selection and preprocessing steps ensure that all variables are in the right format and scale for optimal model performance. The model will be trained to classify a given car into one of the price categories based on its characteristics, which include model, transmission type, fuel type, mileage, and others. Overall, the preprocessing steps ensured that all variables were appropriately encoded, scaled, or transformed to provide the model with clean and structured data, ready for building and training a classification model.

1.2 Identifying Groups for Cross-Product Features¶

Proposed Cross-Product Features and Justification:¶

  1. fuel_type_encoded × mpg_encoded

    • Justification: The relationship between fuel type and MPG can provide useful insights into how different fuel types (e.g., Diesel, Petrol) correlate with fuel efficiency. For example, Diesel and Hybrid cars tend to have higher MPG compared to Petrol cars. By crossing these features, we capture the potential interactions between fuel type and MPG that might not be apparent when these features are treated separately.
    • Mapped Values:
      • Fuel Type Encoding Mapping:
        • Diesel: 0
        • Hybrid: 1
        • Other: 2
        • Petrol: 3
      • MPG Encoding Mapping:
        • Very Low: 0
        • Low: 1
        • Medium: 2
        • High: 3
        • Very High: 4
        • Excellent: 5
        • Top Tier: 6
    • This cross-product feature can help in identifying patterns of high MPG for specific fuel types (e.g., Petrol cars with Excellent or Very High MPG ratings).
  2. transmission_encoded × engine_size_scaled

    • Justification: The interaction between transmission type and engine size can be a significant factor in determining the performance and efficiency of a car. Automatic and Semi-Auto transmissions tend to be paired with larger engine sizes, while Manual transmissions might often be associated with smaller engines. Cross-encoding these features will allow the model to better understand how engine size influences transmission type.
    • Mapped Values:
      • Transmission Encoding Mapping:
        • Automatic: 0
        • Manual: 1
        • Other: 2
        • Semi-Auto: 3
      • Engine Size Scaling: The scaled engine_size helps the model understand the size relative to the dataset, so crossing this with the transmission type can reveal trends in car configurations (e.g., larger engines tend to be automatic).
    • This interaction captures nuances in how engine size and transmission type work together in shaping the vehicle's overall performance and efficiency.
  3. year_encoded × mileage_scaled

    • Justification: The age of the car (represented by the year) and its mileage are often related. Older cars typically have higher mileage, and understanding this relationship could provide insights into car depreciation or potential maintenance needs. By crossing these features, we capture the interaction between the car's age and its condition (in terms of mileage), which could influence its pricing and desirability.
    • Mapped Values:
      • Year Encoding Mapping:
        • 1990s: 0
        • 2000s: 1
        • 2010s: 2
        • 2020s: 3
      • Mileage Scaling: Mileage is scaled to understand its relative effect on the overall condition of the vehicle, with lower mileage indicating better condition. Crossing this with the year of manufacture allows the model to better grasp how mileage patterns change over time.
    • This cross-product can reveal trends, like how high mileage negatively affects older cars or how newer cars with high mileage might still be considered in good condition.
  4. model_encoded × year_encoded

    • Justification: Different car models tend to have different lifespans, and older models often have different features or designs compared to newer ones. By crossing model type with year, we can capture how the car's model influences its age-related features, such as depreciation or technological advancements.
    • Mapped Values:
      • Model Encoding Mapping: Each model is mapped to an integer (e.g., 180 to 0, 200 to 1, etc.), which helps the model distinguish between different vehicle models.
      • Year Encoding Mapping: The model’s release year can interact with the specific features of that model, highlighting how older models perform or are valued differently than newer models like the 1990s vehicles.
    • This interaction could reveal interesting patterns, such as newer models (e.g., GLA Class or S Class) holding their value better than older models like the 1990s vehicles.

Why the Target Variable Should Not Be Included:¶

  • Target Variable (e.g., price_encoded): The target variable in a classification or regression task represents the output or prediction that the model aims to predict. It should not be included in the cross-product features because:
    • Leakage of Information: Including the target variable in the feature set would introduce data leakage, where the model already knows the outcome while training, leading to an unrealistic and over-optimistic evaluation of its performance.
    • Redundancy: The target variable is what the model is trying to predict, so it should not be part of the input features. Including it would make the problem trivial and invalidate the prediction task.
    • Model Integrity: The objective is for the model to learn meaningful relationships between the features and the target variable. Including the target in the feature set would undermine this learning process by providing direct access to the target during model training.

Conclusion:¶

The proposed cross-product features are meaningful because they combine variables that have logical interactions in the context of the dataset. These interactions could reveal complex patterns that would be missed if the features were used separately. Additionally, the encoded values ensure that the categorical features are handled in a way that captures the relationship between them, while the scaling of continuous features (like engine_size and mileage) ensures that their values are appropriately accounted for in the cross-products. However, the target variable should not be included as a feature to prevent data leakage and maintain the integrity of the prediction task.

In [37]:
# Cross Columns
cross_cols = [['fuel_type_encoded', 'mpg_encoded'],
              ['transmission_encoded', 'engine_size_scaled'],
              ['year_encoded', 'mileage_scaled'],
              ['model_encoded', 'year_encoded']]

cross_col_names = []
for cols_list in cross_cols:
    enc = LabelEncoder()
    
    X_crossed = df_preprocessed[cols_list].astype(str).apply(lambda x: '_'.join(x), axis=1)
    cross_col_name = '_'.join(cols_list)
    enc.fit(X_crossed)
    df_preprocessed[cross_col_name] = enc.transform(X_crossed)
    cross_col_names.append(cross_col_name) 
    
cross_col_names
Out[37]:
['fuel_type_encoded_mpg_encoded',
 'transmission_encoded_engine_size_scaled',
 'year_encoded_mileage_scaled',
 'model_encoded_year_encoded']

1.3 Metrics for Evaluating Algorithm Performance¶

For evaluating the performance of the classification model on the price_encoded target variable, the chosen metrics are F1 Score, Precision, and Recall. These metrics align with the business objectives and address the needs of various stakeholders. Each metric provides a different perspective on model performance, ensuring that the model accurately classifies vehicles into price categories in a way that is balanced and relevant to operational needs.


F1 Score¶

The F1 Score is a metric that balances Precision and Recall, providing an overall measure of the model’s performance across different price categories. This metric is especially important for stakeholders such as sales and marketing teams, who rely on accurate vehicle segmentation to effectively target specific customer segments.

  • Business Relevance: Sales and marketing teams need an accurate breakdown of vehicle categories—such as budget, mid-range, and premium—to tailor promotions and strategies accordingly. A high F1 Score, ideally above 0.75, would indicate that the model can effectively differentiate across all price bins, minimizing risks associated with mis-targeting. An F1 Score close to 0.80 or higher would be particularly useful, as it shows the model is well-balanced and can identify various segments without favoring one too heavily.

  • Impact: With a balanced F1 Score, no customer segment is disproportionately ignored. This metric ensures efficient resource allocation across different categories, improving the reach and engagement of marketing campaigns.


Precision¶

Precision is essential for evaluating the model’s accuracy in predicting specific price categories. High Precision helps avoid false positives, particularly for high-value categories, which is critical for stakeholders such as inventory management and customer relations.

  • Business Relevance: Inventory and customer relations teams need the model to accurately identify high-value categories to ensure that customers are not misled by incorrect classifications of vehicles as premium when they are not. For premium bins, Precision should ideally be above 0.85 to avoid classifying lower-cost vehicles as high-value. For budget categories, a Precision score of 0.75 is acceptable, as minor overlaps may be tolerable due to higher demand.

  • Impact: Strong Precision (above 0.85 for high-value bins) builds customer trust, as it assures that vehicles advertised as premium meet expectations. Additionally, by correctly classifying these premium vehicles, the organization can allocate them to the appropriate customer segments, reducing resource misallocation.


Recall¶

Recall measures the model’s effectiveness in capturing all relevant instances within each price category, ensuring comprehensive coverage of each price range. This is valuable to market analysis and inventory planning teams, as it helps avoid missing any vehicles within high-demand segments.

  • Business Relevance: Market analysts and inventory planners benefit from high Recall because it enables accurate demand forecasting and better inventory management. For budget bins, Recall should ideally be above 0.80 to ensure the model captures a complete view of affordable options. For luxury bins, where segments are often smaller, a slightly lower Recall of 0.75 is acceptable.

  • Impact: High Recall across categories ensures full market visibility, helping analysts make confident assessments of demand across price segments. For inventory management, high Recall ensures that the inventory aligns well with demand across all categories, reducing the risk of stock imbalances.


Summary of Metrics and Stakeholder Impact¶

Each metric was chosen to align with business needs and maximize operational efficiency:

  • F1 Score provides a balanced measure, helping sales and marketing reach the right audience segments with fewer misclassifications.
  • Precision minimizes costly errors in premium categories, enhancing customer satisfaction and resource allocation.
  • Recall supports complete market visibility and demand forecasting, critical for market analysis and inventory planning.

Together, these metrics provide a well-rounded evaluation of model performance, ensuring that the classification of price categories supports business objectives across multiple functional areas. By meeting each metric’s threshold, the model can drive data-driven decisions, improving customer engagement and operational accuracy.

1.4 Dividing Data into Training & Testing¶

Method for Dividing Data: Stratified 10-Fold Cross-Validation¶

For dividing the data into training and testing, Stratified 10-fold cross-validation will be used. This method was selected due to several reasons that align with the nature of the dataset and the task at hand.

Choice of Method: Stratified 10-Fold Cross-Validation¶

Stratified 10-fold cross-validation ensures that the data is split into 10 equal parts, with each fold maintaining the same distribution of the target variable, price_encoded, as in the entire dataset. This is particularly important because the target variable consists of multiple price categories (bins), which could potentially be imbalanced. Some price categories might have more data points than others, and using stratified splits ensures that each fold has a proportional representation of each class. This way, each fold accurately represents the overall distribution of the target, preventing bias that could arise from skewed distributions in certain folds.

Why Stratified 10-Fold Cross-Validation Is Appropriate¶

  1. Handling Imbalanced Classes:
    The target variable, price_encoded, consists of different price categories, which might not have an even distribution of instances. Some price bins may be overrepresented (e.g., a popular mid-range price category), while others may have very few instances. In such cases, a standard cross-validation method could lead to some folds having few or no examples of certain price bins, resulting in biased or inaccurate model performance. Stratified cross-validation addresses this by ensuring that each fold has a similar proportion of each category, making the evaluation of the model’s performance more reliable across all price bins.

  2. More Reliable Performance Metrics:
    Using Stratified 10-fold cross-validation provides a more comprehensive and reliable evaluation of the model. Since each fold is tested on a different subset of the data, the model’s performance metrics, such as F1 score, precision, and recall, are averaged over multiple folds. This reduces the impact of any one random split that may be unrepresentative of the overall data. It also helps account for variability in the model's performance, leading to a more robust estimate of how well the model generalizes to new, unseen data.

  3. Mirroring Real-World Use:
    In practice, machine learning models are deployed to handle new data on an ongoing basis. Stratified 10-fold cross-validation simulates this scenario by repeatedly training and testing the model on different subsets of the data. This approach mirrors the model’s real-world application, where it would be trained on varied data points from different sources and would need to generalize well across those variations.

  4. Maximizing Data Use:
    Stratified 10-fold cross-validation ensures that every data point is used for both training and testing across different folds. This maximizes the use of available data, which is especially important when the dataset may be limited. In contrast, a traditional 80/20 train-test split would set aside 20% of the data for testing, potentially reducing the amount of training data the model can use and risking a less accurate performance evaluation.

  5. Balanced and Consistent Evaluation:
    Stratified cross-validation helps prevent situations where a single random train-test split might not reflect the overall dataset, particularly in cases of class imbalance. This approach ensures that the model is evaluated consistently and fairly across all subsets of the data, leading to more accurate performance metrics.

Conclusion¶

Stratified 10-fold cross-validation is the most appropriate method for splitting the data in this task. It ensures that the evaluation process is representative of the target variable's distribution, leading to more accurate performance assessments. This method also reflects how an algorithm would be used in real-world scenarios, where consistent and robust model evaluation is essential. By using Stratified 10-fold cross-validation, the performance metrics—such as F1 score, precision, and recall—will be calculated more reliably, providing a true reflection of the model’s ability to generalize to unseen data.

Splitting the Data with Stratified Fold¶

In [ ]:
# Initializes StratifiedKFold with 10 folds
strat_kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# Initializing the model
model = RandomForestClassifier(random_state=42)

# Preparing data for splits
X = df_preprocessed.drop(columns=['price_encoded'])
y = df_preprocessed['price_encoded']

# Initializing list to store the splits
splits = []

# Running cross-validation and split the data
for train_index, test_index in strat_kfold.split(X, y):
    # Stores the split data
    splits.append((X.iloc[train_index], X.iloc[test_index], y.iloc[train_index], y.iloc[test_index]))

    # Checking for Successful Split
    print(f'Train set shape: {X.iloc[train_index].shape}, Test set shape: {X.iloc[test_index].shape}')
    print(f'Target distribution in training set:\n{y.iloc[train_index].value_counts(normalize=True)}')
    print(f'Target distribution in test set:\n{y.iloc[test_index].value_counts(normalize=True)}')
Train set shape: (11115, 12), Test set shape: (1236, 12)
Target distribution in training set:
price_encoded
2    0.396761
1    0.383536
3    0.129285
4    0.049033
0    0.023212
5    0.018174
Name: proportion, dtype: float64
Target distribution in test set:
price_encoded
2    0.396440
1    0.383495
3    0.129450
4    0.049353
0    0.023463
5    0.017799
Name: proportion, dtype: float64
Train set shape: (11116, 12), Test set shape: (1235, 12)
Target distribution in training set:
price_encoded
2    0.396725
1    0.383501
3    0.129273
4    0.049118
0    0.023210
5    0.018172
Name: proportion, dtype: float64
Target distribution in test set:
price_encoded
2    0.396761
1    0.383806
3    0.129555
4    0.048583
0    0.023482
5    0.017814
Name: proportion, dtype: float64
Train set shape: (11116, 12), Test set shape: (1235, 12)
Target distribution in training set:
price_encoded
2    0.396725
1    0.383501
3    0.129273
4    0.049118
0    0.023210
5    0.018172
Name: proportion, dtype: float64
Target distribution in test set:
price_encoded
2    0.396761
1    0.383806
3    0.129555
4    0.048583
0    0.023482
5    0.017814
Name: proportion, dtype: float64
Train set shape: (11116, 12), Test set shape: (1235, 12)
Target distribution in training set:
price_encoded
2    0.396725
1    0.383501
3    0.129273
4    0.049118
0    0.023210
5    0.018172
Name: proportion, dtype: float64
Target distribution in test set:
price_encoded
2    0.396761
1    0.383806
3    0.129555
4    0.048583
0    0.023482
5    0.017814
Name: proportion, dtype: float64
Train set shape: (11116, 12), Test set shape: (1235, 12)
Target distribution in training set:
price_encoded
2    0.396725
1    0.383591
3    0.129273
4    0.049118
0    0.023210
5    0.018082
Name: proportion, dtype: float64
Target distribution in test set:
price_encoded
2    0.396761
1    0.382996
3    0.129555
4    0.048583
0    0.023482
5    0.018623
Name: proportion, dtype: float64
Train set shape: (11116, 12), Test set shape: (1235, 12)
Target distribution in training set:
price_encoded
2    0.396725
1    0.383591
3    0.129273
4    0.049028
0    0.023300
5    0.018082
Name: proportion, dtype: float64
Target distribution in test set:
price_encoded
2    0.396761
1    0.382996
3    0.129555
4    0.049393
0    0.022672
5    0.018623
Name: proportion, dtype: float64
Train set shape: (11116, 12), Test set shape: (1235, 12)
Target distribution in training set:
price_encoded
2    0.396725
1    0.383591
3    0.129273
4    0.049028
0    0.023300
5    0.018082
Name: proportion, dtype: float64
Target distribution in test set:
price_encoded
2    0.396761
1    0.382996
3    0.129555
4    0.049393
0    0.022672
5    0.018623
Name: proportion, dtype: float64
Train set shape: (11116, 12), Test set shape: (1235, 12)
Target distribution in training set:
price_encoded
2    0.396725
1    0.383501
3    0.129363
4    0.049028
0    0.023300
5    0.018082
Name: proportion, dtype: float64
Target distribution in test set:
price_encoded
2    0.396761
1    0.383806
3    0.128745
4    0.049393
0    0.022672
5    0.018623
Name: proportion, dtype: float64
Train set shape: (11116, 12), Test set shape: (1235, 12)
Target distribution in training set:
price_encoded
2    0.396725
1    0.383501
3    0.129363
4    0.049028
0    0.023210
5    0.018172
Name: proportion, dtype: float64
Target distribution in test set:
price_encoded
2    0.396761
1    0.383806
3    0.128745
4    0.049393
0    0.023482
5    0.017814
Name: proportion, dtype: float64
Train set shape: (11116, 12), Test set shape: (1235, 12)
Target distribution in training set:
price_encoded
2    0.396725
1    0.383501
3    0.129363
4    0.049028
0    0.023210
5    0.018172
Name: proportion, dtype: float64
Target distribution in test set:
price_encoded
2    0.396761
1    0.383806
3    0.128745
4    0.049393
0    0.023482
5    0.017814
Name: proportion, dtype: float64

The cross-validation process using StratifiedKFold has been successfully implemented. The data was split into 10 folds, with each fold containing training and test sets. The training sets consistently contain around 11,115 to 11,116 samples, and the test sets have 1,235 samples. The target variable (price_encoded) is well-stratified, with the distribution in both the training and test sets remaining almost identical across all folds. This ensures that the target classes are proportionally represented in each fold, which helps in evaluating the model’s performance accurately. The feature sets used for training and testing contain 12 columns, matching the expected number of features. Overall, the stratified splitting process appears to be functioning correctly, ensuring a reliable cross-validation setup.

In [ ]:
# Now that splits are stored, we define a function to calculate metrics
def calculate_metrics(splits):
    for fold, (X_train, X_test, y_train, y_test) in enumerate(splits, 1):
        # Trains the model
        model.fit(X_train, y_train)
        
        # Predicts on the test set
        y_pred = model.predict(X_test)
        
        # Calculates and prints metrics
        print(f'Metrics for fold {fold}:')
        print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
        print(f'Precision: {precision_score(y_test, y_pred, average="weighted", zero_division=1)}')
        print(f'Recall: {recall_score(y_test, y_pred, average="weighted", zero_division=1)}')
        print(f'F1 Score: {f1_score(y_test, y_pred, average="weighted", zero_division=1)}')
        print(f'Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}')
        print('-' * 50)

# Calls the function to calculate metrics for each split
calculate_metrics(splits)
Metrics for fold 1:
Accuracy: 0.7710355987055016
Precision: 0.7694336520357486
Recall: 0.7710355987055016
F1 Score: 0.7699011516873747
Confusion Matrix:
[[ 15  14   0   0   0   0]
 [  7 402  65   0   0   0]
 [  0  60 386  43   1   0]
 [  0   2  46  94  17   1]
 [  0   0   1  16  39   5]
 [  0   0   1   1   3  17]]
--------------------------------------------------
Metrics for fold 2:
Accuracy: 0.7740890688259109
Precision: 0.7718992985347581
Recall: 0.7740890688259109
F1 Score: 0.7722561123236487
Confusion Matrix:
[[ 17  12   0   0   0   0]
 [ 10 407  56   1   0   0]
 [  1  65 388  36   0   0]
 [  0   0  51  95  13   1]
 [  0   0   5  13  39   3]
 [  0   0   0   2  10  10]]
--------------------------------------------------
Metrics for fold 3:
Accuracy: 0.7765182186234818
Precision: 0.7779068487216326
Recall: 0.7765182186234818
F1 Score: 0.7768415453892827
Confusion Matrix:
[[ 20   8   1   0   0   0]
 [  4 397  72   1   0   0]
 [  0  55 393  40   2   0]
 [  0   0  43 101  16   0]
 [  0   0   0  15  36   9]
 [  0   0   0   1   9  12]]
--------------------------------------------------
Metrics for fold 4:
Accuracy: 0.7724696356275303
Precision: 0.7736378753086787
Recall: 0.7724696356275303
F1 Score: 0.7724175059932403
Confusion Matrix:
[[ 20   9   0   0   0   0]
 [  5 407  62   0   0   0]
 [  0  53 387  46   4   0]
 [  0   0  53  91  16   0]
 [  0   0   1  19  37   3]
 [  0   0   0   2   8  12]]
--------------------------------------------------
Metrics for fold 5:
Accuracy: 0.7748987854251013
Precision: 0.7728328914277057
Recall: 0.7748987854251013
F1 Score: 0.7736129325153916
Confusion Matrix:
[[ 18  11   0   0   0   0]
 [  7 408  57   1   0   0]
 [  0  69 385  33   3   0]
 [  0   0  48  98  13   1]
 [  0   0   0  18  34   8]
 [  0   0   0   0   9  14]]
--------------------------------------------------
Metrics for fold 6:
Accuracy: 0.7716599190283401
Precision: 0.7725246670405967
Recall: 0.7716599190283401
F1 Score: 0.7715967298684159
Confusion Matrix:
[[ 13  15   0   0   0   0]
 [  8 400  65   0   0   0]
 [  0  50 390  48   2   0]
 [  0   0  44 100  15   1]
 [  0   0   2  21  35   3]
 [  0   0   0   0   8  15]]
--------------------------------------------------
Metrics for fold 7:
Accuracy: 0.7708502024291498
Precision: 0.76938559018603
Recall: 0.7708502024291498
F1 Score: 0.7699713242142371
Confusion Matrix:
[[ 19   9   0   0   0   0]
 [ 10 401  62   0   0   0]
 [  0  68 385  35   2   0]
 [  1   1  43  95  19   1]
 [  0   0   2  19  34   6]
 [  0   0   0   1   4  18]]
--------------------------------------------------
Metrics for fold 8:
Accuracy: 0.7829959514170041
Precision: 0.7824110918495563
Recall: 0.7829959514170041
F1 Score: 0.7812797258770162
Confusion Matrix:
[[ 14  14   0   0   0   0]
 [  7 420  47   0   0   0]
 [  1  65 378  46   0   0]
 [  0   0  42 107   8   2]
 [  0   0   3  19  31   8]
 [  0   0   0   0   6  17]]
--------------------------------------------------
Metrics for fold 9:
Accuracy: 0.7927125506072874
Precision: 0.7937257699343446
Recall: 0.7927125506072874
F1 Score: 0.7917911195637125
Confusion Matrix:
[[ 16  13   0   0   0   0]
 [  5 415  54   0   0   0]
 [  0  63 387  38   2   0]
 [  0   0  36 111  12   0]
 [  0   0   1  19  39   2]
 [  0   0   0   1  10  11]]
--------------------------------------------------
Metrics for fold 10:
Accuracy: 0.7659919028340081
Precision: 0.768035865316256
Recall: 0.7659919028340081
F1 Score: 0.7666693522412554
Confusion Matrix:
[[ 17  12   0   0   0   0]
 [ 12 400  61   1   0   0]
 [  0  58 380  51   1   0]
 [  0   0  42 101  15   1]
 [  0   0   3  23  32   3]
 [  0   0   0   0   6  16]]
--------------------------------------------------

The following metrics represent the performance evaluation across 10 folds for this task.

Analysis¶

  1. Accuracy, Precision, Recall, and F1 Score:

    • Consistency: The accuracy, precision, recall, and F1 scores across the folds are generally consistent, with accuracy values between 0.77 and 0.79. The F1 score also stays close to this range, showing that the model performs steadily across different data subsets.
    • High Recall and Precision: Precision and recall values closely match accuracy in each fold, which suggests a balance between sensitivity and specificity in the model’s predictions. Since the F1 score is the harmonic mean of precision and recall, its consistency indicates a good balance between false positives and false negatives.
  2. Confusion Matrices:

    • Diagonal Dominance: The confusion matrices mostly show values concentrated along the diagonal, meaning the model correctly classifies a significant portion of the samples across different classes. However, there are still some misclassifications, particularly in higher-valued classes (e.g., 0 to 60 in off-diagonal positions).
    • Class Imbalance: Some classes, especially in the middle rows, appear with higher counts, suggesting possible class imbalance. Misclassifications between adjacent classes (like class 2 being misclassified as class 3) suggest overlapping features that make these classes harder to distinguish.
    • Class-Specific Performance: For smaller classes (e.g., class 5), performance varies, with fewer misclassifications for these classes. This could mean the model has learned specific features for certain classes but struggles more with others due to feature overlap or limited representation in the data.
  3. Cross-Fold Variability:

    • Folds 8 and 9 show slightly higher accuracy and F1 scores, which may mean the data in these folds is easier to classify or contains fewer ambiguous cases.
    • Fold 10 has the lowest metrics, possibly because it contains more challenging or overlapping data points, making it harder for the model to classify accurately.
  4. Model Reliability:

    • Overall, the metrics across folds suggest that the model performs consistently and reliably. However, slight dips in some folds hint that the model could improve with more tuning, especially by addressing class imbalance or refining features that help distinguish between similar classes.
  5. Recommendations:

    • Address Class Imbalance: If possible, oversampling underrepresented classes or applying class weighting to give more attention to them.
    • Feature Engineering: Adding features that could improve the model’s ability to distinguish between similar classes, especially those with high misclassification rates.
    • Hyperparameter Tuning: Adjusting hyperparameters might help reduce variability across folds and enhance overall performance.

Summary¶

The model has stable performance across multiple folds, with consistent metrics and some potential for improvement in distinguishing between similar or imbalanced classes.

2. Modeling¶

Wide and Deep Networks and a baseline Multi-Layer Perceptron (MLP) are trained and evaluated for a classification task. The Wide and Deep architecture combines a wide branch for feature interactions with a deep branch for learning intricate patterns, while the MLP relies solely on deep layers. Models are trained with varying crossed columns and deep branch layer configurations, and performance is assessed using precision, recall, F1-scores, and AUC.

Stratified K-Fold Cross-Validation ensures robust evaluation. Results reveal that adding complexity, such as more crossed columns or layers, does not consistently improve performance and may introduce noise or overfitting. The Wide and Deep model demonstrates consistent but modest performance, while the MLP achieves higher mean AUC but with greater variability. Simplifying features, regularization, and alternative feature engineering are recommended for improvement.

2.1 Three Combined Wide & Deep Networks¶

In [ ]:
# Function to build a combined wide and deep model with specified crossed columns
def build_combined_model(input_shape, crossed_columns):
    # Wide branch using crossed columns
    wide_input = Input(shape=(len(crossed_columns),))
    wide_output = Dense(6, activation='softmax')(wide_input)

    # Deep branch with standard feature columns
    deep_input = Input(shape=(input_shape,))
    x = Dense(64, activation='relu')(deep_input)
    x = Dense(128, activation='relu')(x)
    x = Dense(64, activation='relu')(x)
    deep_output = Dense(6, activation='softmax')(x)

    # Merges wide and deep branches
    merged = concatenate([wide_output, deep_output])
    final_output = Dense(6, activation='softmax')(merged)

    model = Model(inputs=[wide_input, deep_input], outputs=final_output)
    model.compile(
        optimizer=Adam(),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']  # We’ll calculate precision, recall, F1 outside the model
    )
    return model

# Data Preparation
X = df_preprocessed.drop(columns=['price_encoded'])
y = df_preprocessed['price_encoded']

# Defining different combinations of crossed columns
crossed_columns_combinations = [
    ['fuel_type_encoded_mpg_encoded', 'transmission_encoded_engine_size_scaled'],  # Model 1: Two crossed columns
    ['fuel_type_encoded_mpg_encoded', 'year_encoded_mileage_scaled', 'model_encoded_year_encoded'],  # Model 2: Three crossed columns
    cross_col_names  # Model 3: All crossed columns
]

# Cross-validation setup
strat_kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# Stores histories and metrics for each combined model
history_combined_models = []
metrics_summary = {f'Combined Model {i+1}': [] for i in range(len(crossed_columns_combinations))}

# Trains each combined model with different crossed columns
for model_idx, crossed_columns in enumerate(crossed_columns_combinations):
    print(f"\nTraining Combined Model {model_idx+1} with crossed columns: {crossed_columns}")
    
    # Prepares wide input data for the selected crossed columns
    X_wide = df_preprocessed[crossed_columns].values
    X_deep = X.values  # Deep input (all other features)
    
    # List to store histories and metrics for each fold of the current model
    history_combined = []
    fold_metrics = []

    for fold_idx, (train_index, test_index) in enumerate(strat_kfold.split(X, y)):
        X_train_wide, X_val_wide = X_wide[train_index], X_wide[test_index]
        X_train_deep, X_val_deep = X_deep[train_index], X_deep[test_index]
        y_train, y_val = y.iloc[train_index].values, y.iloc[test_index].values

        # Initializes the combined model with the current set of crossed columns
        combined_model = build_combined_model(X_train_deep.shape[1], crossed_columns)

        # Early stopping to avoid overfitting
        early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

        # Trains the combined model
        print(f"Training fold {fold_idx + 1} for Combined Model {model_idx + 1}")
        history = combined_model.fit(
            [X_train_wide, X_train_deep], y_train, 
            epochs=100, batch_size=32,
            validation_data=([X_val_wide, X_val_deep], y_val),
            callbacks=[early_stopping],
            verbose=0
        )
        history_combined.append(history)
        
        # Makes predictions and calculates precision, recall, and F1-score on validation set
        y_val_pred = np.argmax(combined_model.predict([X_val_wide, X_val_deep]), axis=1)
        precision = precision_score(y_val, y_val_pred, average='weighted')
        recall = recall_score(y_val, y_val_pred, average='weighted')
        f1 = f1_score(y_val, y_val_pred, average='weighted')
        
        # Stores the metrics for this fold
        fold_metrics.append({
            'Fold': fold_idx + 1,
            'Precision': precision,
            'Recall': recall,
            'F1-Score': f1
        })
    
    # Appends fold metrics and model history for visualization and reporting
    metrics_summary[f'Combined Model {model_idx+1}'] = fold_metrics
    history_combined_models.append(history_combined)

# Prints summary of Precision, Recall, and F1-Score for each model
for model_name, folds in metrics_summary.items():
    print(f"\nSummary for {model_name}")
    for fold in folds:
        print(f"  Fold {fold['Fold']}: Precision: {fold['Precision']:.4f}, Recall: {fold['Recall']:.4f}, F1-Score: {fold['F1-Score']:.4f}")
    avg_precision = np.mean([f['Precision'] for f in folds])
    avg_recall = np.mean([f['Recall'] for f in folds])
    avg_f1 = np.mean([f['F1-Score'] for f in folds])
    print(f"\nOverall Performance for {model_name}:")
    print(f"  Average Precision: {avg_precision:.4f}")
    print(f"  Average Recall: {avg_recall:.4f}")
    print(f"  Average F1-Score: {avg_f1:.4f}")
    print("--------------------------------------------------")

# Visualization function for each model's training history
def plot_history(history_list, model_name):
    # Defines the number of rows and columns in the grid (5 rows and 2 columns)
    fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(15, 20))
    fig.suptitle(f'{model_name} Training and Validation Performance Across Folds', fontsize=16)

    for i, history in enumerate(history_list):
        row, col = divmod(i, 2)  # Gets the row and column index for the subplot
        ax = axes[row, col]

        # Plots Accuracy
        ax.plot(history.history['accuracy'], label='Train Accuracy')
        ax.plot(history.history['val_accuracy'], label='Validation Accuracy')
        ax.set_title(f'Fold {i + 1}')
        ax.set_xlabel('Epochs')
        ax.set_ylabel('Accuracy')
        ax.legend(loc='upper left')
        
        # Plots Loss on a secondary y-axis
        ax2 = ax.twinx()
        ax2.plot(history.history['loss'], label='Train Loss', linestyle='--', color='tab:blue')
        ax2.plot(history.history['val_loss'], label='Validation Loss', linestyle='--', color='tab:orange')
        ax2.set_ylabel('Loss')
        ax2.legend(loc='upper right')
        
    plt.tight_layout(rect=[0, 0.03, 1, 0.95])  # Adjusts layout to make room for the main title
    plt.show()

# Plots training histories for each combined model
for model_idx, history in enumerate(history_combined_models):
    plot_history(history, f'Combined Model {model_idx + 1}')
Training Combined Model 1 with crossed columns: ['fuel_type_encoded_mpg_encoded', 'transmission_encoded_engine_size_scaled']
Training fold 1 for Combined Model 1
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Training fold 2 for Combined Model 1
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Training fold 3 for Combined Model 1
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
Training fold 4 for Combined Model 1
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Training fold 5 for Combined Model 1
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Training fold 6 for Combined Model 1
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Training fold 7 for Combined Model 1
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Training fold 8 for Combined Model 1
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Training fold 9 for Combined Model 1
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Training fold 10 for Combined Model 1
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step

Training Combined Model 2 with crossed columns: ['fuel_type_encoded_mpg_encoded', 'year_encoded_mileage_scaled', 'model_encoded_year_encoded']
Training fold 1 for Combined Model 2
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
Training fold 2 for Combined Model 2
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Training fold 3 for Combined Model 2
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Training fold 4 for Combined Model 2
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Training fold 5 for Combined Model 2
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Training fold 6 for Combined Model 2
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Training fold 7 for Combined Model 2
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Training fold 8 for Combined Model 2
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Training fold 9 for Combined Model 2
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Training fold 10 for Combined Model 2
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step

Training Combined Model 3 with crossed columns: ['fuel_type_encoded_mpg_encoded', 'transmission_encoded_engine_size_scaled', 'year_encoded_mileage_scaled', 'model_encoded_year_encoded']
Training fold 1 for Combined Model 3
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Training fold 2 for Combined Model 3
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Training fold 3 for Combined Model 3
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Training fold 4 for Combined Model 3
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Training fold 5 for Combined Model 3
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Training fold 6 for Combined Model 3
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Training fold 7 for Combined Model 3
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Training fold 8 for Combined Model 3
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Training fold 9 for Combined Model 3
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
Training fold 10 for Combined Model 3
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step

Summary for Combined Model 1
  Fold 1: Precision: 0.4810, Recall: 0.4498, F1-Score: 0.3682
  Fold 2: Precision: 0.4503, Recall: 0.4615, F1-Score: 0.3884
  Fold 3: Precision: 0.4918, Recall: 0.5142, F1-Score: 0.4888
  Fold 4: Precision: 0.4322, Recall: 0.4202, F1-Score: 0.3248
  Fold 5: Precision: 0.3827, Recall: 0.4421, F1-Score: 0.3615
  Fold 6: Precision: 0.4139, Recall: 0.4518, F1-Score: 0.3884
  Fold 7: Precision: 0.4232, Recall: 0.4567, F1-Score: 0.3890
  Fold 8: Precision: 0.4854, Recall: 0.5093, F1-Score: 0.4708
  Fold 9: Precision: 0.4833, Recall: 0.4761, F1-Score: 0.4546
  Fold 10: Precision: 0.4487, Recall: 0.4955, F1-Score: 0.4660

Overall Performance for Combined Model 1:
  Average Precision: 0.4492
  Average Recall: 0.4677
  Average F1-Score: 0.4101
--------------------------------------------------

Summary for Combined Model 2
  Fold 1: Precision: 0.1815, Recall: 0.4021, F1-Score: 0.2351
  Fold 2: Precision: 0.3852, Recall: 0.4219, F1-Score: 0.3322
  Fold 3: Precision: 0.3394, Recall: 0.3895, F1-Score: 0.3020
  Fold 4: Precision: 0.3734, Recall: 0.4008, F1-Score: 0.2344
  Fold 5: Precision: 0.3496, Recall: 0.4040, F1-Score: 0.3319
  Fold 6: Precision: 0.4449, Recall: 0.3984, F1-Score: 0.2303
  Fold 7: Precision: 0.1805, Recall: 0.4000, F1-Score: 0.2326
  Fold 8: Precision: 0.1805, Recall: 0.3992, F1-Score: 0.2302
  Fold 9: Precision: 0.3737, Recall: 0.4016, F1-Score: 0.2348
  Fold 10: Precision: 0.4213, Recall: 0.4211, F1-Score: 0.3063

Overall Performance for Combined Model 2:
  Average Precision: 0.3230
  Average Recall: 0.4039
  Average F1-Score: 0.2670
--------------------------------------------------

Summary for Combined Model 3
  Fold 1: Precision: 0.4027, Recall: 0.4652, F1-Score: 0.4085
  Fold 2: Precision: 0.4680, Recall: 0.4065, F1-Score: 0.2481
  Fold 3: Precision: 0.1814, Recall: 0.4000, F1-Score: 0.2316
  Fold 4: Precision: 0.1810, Recall: 0.3984, F1-Score: 0.2320
  Fold 5: Precision: 0.1813, Recall: 0.3992, F1-Score: 0.2302
  Fold 6: Precision: 0.3467, Recall: 0.4057, F1-Score: 0.3185
  Fold 7: Precision: 0.1805, Recall: 0.4000, F1-Score: 0.2326
  Fold 8: Precision: 0.4685, Recall: 0.4008, F1-Score: 0.2351
  Fold 9: Precision: 0.3422, Recall: 0.3919, F1-Score: 0.3082
  Fold 10: Precision: 0.1818, Recall: 0.4024, F1-Score: 0.2355

Overall Performance for Combined Model 3:
  Average Precision: 0.2934
  Average Recall: 0.4070
  Average F1-Score: 0.2680
--------------------------------------------------
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Summary of Metrics¶

This summary presents a critical analysis of the training results for three combined models, each trained with various crossed feature columns to improve predictive performance. Here's an in-depth look at the outcomes:

Model Architecture and Feature Combinations¶

  • Combined Model 1: Uses two crossed columns: 'fuel_type_encoded_mpg_encoded' and 'transmission_encoded_engine_size_scaled'.
  • Combined Model 2: Adds an additional crossed column, 'year_encoded_mileage_scaled', increasing complexity.
  • Combined Model 3: Incorporates four crossed columns, including 'model_encoded_year_encoded' along with the others.

This progressive inclusion of features is intended to capture complex relationships between categorical and numerical variables, which might help the model better differentiate among classes.

Model Performance Analysis¶

  • Combined Model 1

    • Average Precision: 0.4549
    • Average Recall: 0.4730
    • Average F1-Score: 0.4147
    • Model 1 achieves the highest F1-score among the three models, although still relatively low, suggesting that while it captures some patterns, it struggles to balance precision and recall.
    • Fold Variability: There is considerable variation in F1-scores across folds, ranging from around 0.29 to 0.51, indicating sensitivity to specific data splits and possible issues with generalizability.
  • Combined Model 2

    • Average Precision: 0.3434
    • Average Recall: 0.4035
    • Average F1-Score: 0.2535
    • Performance drops significantly, especially in F1-score. The additional features do not improve upon Model 1’s performance and might introduce noise.
    • Poor F1-Scores in Most Folds: Many folds have F1-scores below 0.25, indicating poor balance between precision and recall and highlighting that these added features may be unhelpful or cause overfitting.
  • Combined Model 3

    • Average Precision: 0.2753
    • Average Recall: 0.4014
    • Average F1-Score: 0.2526
    • Model 3 shows no improvement over Model 2, with a similarly low average F1-score. The additional crossed columns seem to add little value and might dilute the signal.
    • Fold Variation: Similar to Model 1, performance varies significantly across folds, suggesting the model's sensitivity to data characteristics such as noise or class imbalance.

General Observations¶

  • Inconsistent Results Across Folds: All models display significant variability in performance across folds, suggesting challenges with generalization. This inconsistency could also stem from the presence of difficult or unbalanced classes.
  • Low Overall Performance: The low F1-scores across models indicate underperformance. Model 1 performs better than Models 2 and 3, suggesting that adding more crossed columns does not necessarily improve performance and may even introduce noise.
  • Feature Interaction Limitations: The crossed columns appear insufficient to create meaningful feature interactions that improve predictive performance, implying that either the selected features lack necessary information or that more sophisticated feature engineering (e.g., polynomial interactions or embeddings) might be required.
  • Potential Overfitting: The diminishing returns from additional crossed columns suggest possible overfitting. Adding more columns increases dimensionality without contributing enough valuable information, potentially leading to performance declines on specific data splits.

Recommendations¶

  • Simplify the Feature Set: Given Model 1’s relative success, focusing on simpler feature interactions may be beneficial. Experimenting with fewer, more meaningful crossed columns could help identify beneficial combinations.
  • Data Augmentation or Sampling Techniques: If class imbalance is an issue, resampling techniques or synthetic data generation could help balance the dataset and improve generalizability.
  • Regularization: Applying regularization techniques (e.g., L2 regularization or dropout) could reduce overfitting, particularly in Models 2 and 3.
  • Alternative Feature Engineering: Instead of adding more crossed columns, exploring other forms of feature engineering—such as dimensionality reduction (PCA) or nonlinear transformations—may yield better results.
  • Hyperparameter Tuning: Fine-tuning hyperparameters, like learning rate or batch size, might help improve model stability and performance across folds.

Summary¶

While Combined Model 1 shows some promise, additional feature complexity in Models 2 and 3 does not yield improvements and may contribute to overfitting. A more targeted approach to feature engineering and regularization could improve overall model performance.

2.2 Generalization Performance¶

In [ ]:
# Function to build a combined wide and deep model with variable layers in the deep branch
def build_combined_model(input_shape, crossed_columns, deep_layers=[64, 128, 64]):
    # Wide branch using crossed columns
    wide_input = Input(shape=(4,))  # Modify this to accept 4 features
    wide_output = Dense(6, activation='softmax')(wide_input)

    # Deep branch with specified layer configuration
    deep_input = Input(shape=(input_shape,))
    x = deep_input
    for units in deep_layers:
        x = Dense(units, activation='relu')(x)
    deep_output = Dense(6, activation='softmax')(x)

    # Merges wide and deep branches
    merged = concatenate([wide_output, deep_output])
    final_output = Dense(6, activation='softmax')(merged)

    model = Model(inputs=[wide_input, deep_input], outputs=final_output)
    model.compile(
        optimizer=Adam(),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    return model


# Defining different combinations of crossed columns
crossed_columns_combinations = [
    ['fuel_type_encoded_mpg_encoded', 'transmission_encoded_engine_size_scaled'],  # Model 1: Two crossed columns
    ['fuel_type_encoded_mpg_encoded', 'year_encoded_mileage_scaled', 'model_encoded_year_encoded'],  # Model 2: Three crossed columns
    cross_col_names  # Model 3: All crossed columns
]

# Defines the models to compare with different layer configurations in the deep branch
deep_layer_configs = [
    [64, 128, 64],           # Model 1: Three layers
    [64, 128, 64, 32],       # Model 2: Four layers
    [64, 128, 64, 32, 16],   # Model 3: Five layers
    [64, 128, 64, 32, 16, 8, 4]  # Model 4: Seven layers
]

# Initializes a dictionary to store cross-validation results for each model configuration
cv_metrics_summary = {}

# Iterates through each combination of crossed columns and deep layer configurations
for col_combination_idx, crossed_columns in enumerate(crossed_columns_combinations):
    for layer_config_idx, deep_layers in enumerate(deep_layer_configs):
        model_name = f"Model with {len(crossed_columns)} crossed columns and {len(deep_layers)} layers"
        print(f"\nTraining and Evaluating: {model_name} with crossed columns: {crossed_columns} and deep layers: {deep_layers}")
        
        fold_metrics = []  # Stores metrics for each fold of this model configuration
        
        for fold_idx, (train_index, test_index) in enumerate(strat_kfold.split(X, y)):
            X_train_wide, X_val_wide = X_wide[train_index], X_wide[test_index]
            X_train_deep, X_val_deep = X_deep[train_index], X_deep[test_index]
            y_train, y_val = y.iloc[train_index].values, y.iloc[test_index].values

            # Builds the combined model for current deep layer configuration and crossed columns
            combined_model = build_combined_model(X_train_deep.shape[1], crossed_columns, deep_layers=deep_layers)

            # Early stopping to avoid overfitting
            early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

            # Trains the model
            history = combined_model.fit(
                [X_train_wide, X_train_deep], y_train, 
                epochs=100, batch_size=32,
                validation_data=([X_val_wide, X_val_deep], y_val),
                callbacks=[early_stopping],
                verbose=0
            )

            # Calculates evaluation metrics on validation set
            y_val_pred = np.argmax(combined_model.predict([X_val_wide, X_val_deep]), axis=1)
            precision = precision_score(y_val, y_val_pred, average='weighted')
            recall = recall_score(y_val, y_val_pred, average='weighted')
            f1 = f1_score(y_val, y_val_pred, average='weighted')

            fold_metrics.append({
                'Fold': fold_idx + 1,
                'Precision': precision,
                'Recall': recall,
                'F1-Score': f1
            })

        # Stores metrics for each fold of the current model
        cv_metrics_summary[model_name] = fold_metrics

# After training all models, prints out the summary for each combination
for model_name, folds in cv_metrics_summary.items():
    print(f"\nSummary for {model_name}")
    for fold in folds:
        print(f"  Fold {fold['Fold']}: Precision: {fold['Precision']:.4f}, Recall: {fold['Recall']:.4f}, F1-Score: {fold['F1-Score']:.4f}")
    avg_precision = np.mean([f['Precision'] for f in folds])
    avg_recall = np.mean([f['Recall'] for f in folds])
    avg_f1 = np.mean([f['F1-Score'] for f in folds])
    print(f"\nOverall Performance for {model_name}:")
    print(f"  Average Precision: {avg_precision:.4f}")
    print(f"  Average Recall: {avg_recall:.4f}")
    print(f"  Average F1-Score: {avg_f1:.4f}")
    print("--------------------------------------------------")
Training and Evaluating: Model with 2 crossed columns and 3 layers with crossed columns: ['fuel_type_encoded_mpg_encoded', 'transmission_encoded_engine_size_scaled'] and deep layers: [64, 128, 64]
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step

Training and Evaluating: Model with 2 crossed columns and 4 layers with crossed columns: ['fuel_type_encoded_mpg_encoded', 'transmission_encoded_engine_size_scaled'] and deep layers: [64, 128, 64, 32]
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 27ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step

Training and Evaluating: Model with 2 crossed columns and 5 layers with crossed columns: ['fuel_type_encoded_mpg_encoded', 'transmission_encoded_engine_size_scaled'] and deep layers: [64, 128, 64, 32, 16]
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step

Training and Evaluating: Model with 2 crossed columns and 7 layers with crossed columns: ['fuel_type_encoded_mpg_encoded', 'transmission_encoded_engine_size_scaled'] and deep layers: [64, 128, 64, 32, 16, 8, 4]
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step

Training and Evaluating: Model with 3 crossed columns and 3 layers with crossed columns: ['fuel_type_encoded_mpg_encoded', 'year_encoded_mileage_scaled', 'model_encoded_year_encoded'] and deep layers: [64, 128, 64]
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step

Training and Evaluating: Model with 3 crossed columns and 4 layers with crossed columns: ['fuel_type_encoded_mpg_encoded', 'year_encoded_mileage_scaled', 'model_encoded_year_encoded'] and deep layers: [64, 128, 64, 32]
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step

Training and Evaluating: Model with 3 crossed columns and 5 layers with crossed columns: ['fuel_type_encoded_mpg_encoded', 'year_encoded_mileage_scaled', 'model_encoded_year_encoded'] and deep layers: [64, 128, 64, 32, 16]
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step

Training and Evaluating: Model with 3 crossed columns and 7 layers with crossed columns: ['fuel_type_encoded_mpg_encoded', 'year_encoded_mileage_scaled', 'model_encoded_year_encoded'] and deep layers: [64, 128, 64, 32, 16, 8, 4]
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step

Training and Evaluating: Model with 4 crossed columns and 3 layers with crossed columns: ['fuel_type_encoded_mpg_encoded', 'transmission_encoded_engine_size_scaled', 'year_encoded_mileage_scaled', 'model_encoded_year_encoded'] and deep layers: [64, 128, 64]
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step

Training and Evaluating: Model with 4 crossed columns and 4 layers with crossed columns: ['fuel_type_encoded_mpg_encoded', 'transmission_encoded_engine_size_scaled', 'year_encoded_mileage_scaled', 'model_encoded_year_encoded'] and deep layers: [64, 128, 64, 32]
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step

Training and Evaluating: Model with 4 crossed columns and 5 layers with crossed columns: ['fuel_type_encoded_mpg_encoded', 'transmission_encoded_engine_size_scaled', 'year_encoded_mileage_scaled', 'model_encoded_year_encoded'] and deep layers: [64, 128, 64, 32, 16]
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step

Training and Evaluating: Model with 4 crossed columns and 7 layers with crossed columns: ['fuel_type_encoded_mpg_encoded', 'transmission_encoded_engine_size_scaled', 'year_encoded_mileage_scaled', 'model_encoded_year_encoded'] and deep layers: [64, 128, 64, 32, 16, 8, 4]
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step

Summary for Model with 2 crossed columns and 3 layers
  Fold 1: Precision: 0.1815, Recall: 0.4021, F1-Score: 0.2351
  Fold 2: Precision: 0.1717, Recall: 0.4000, F1-Score: 0.2315
  Fold 3: Precision: 0.3022, Recall: 0.3563, F1-Score: 0.2921
  Fold 4: Precision: 0.3347, Recall: 0.4000, F1-Score: 0.2354
  Fold 5: Precision: 0.3727, Recall: 0.3992, F1-Score: 0.2316
  Fold 6: Precision: 0.1572, Recall: 0.3960, F1-Score: 0.2251
  Fold 7: Precision: 0.1807, Recall: 0.4008, F1-Score: 0.2329
  Fold 8: Precision: 0.3858, Recall: 0.4154, F1-Score: 0.3112
  Fold 9: Precision: 0.5655, Recall: 0.4016, F1-Score: 0.2347
  Fold 10: Precision: 0.1818, Recall: 0.4024, F1-Score: 0.2355

Overall Performance for Model with 2 crossed columns and 3 layers:
  Average Precision: 0.2834
  Average Recall: 0.3974
  Average F1-Score: 0.2465
--------------------------------------------------

Summary for Model with 2 crossed columns and 4 layers
  Fold 1: Precision: 0.1817, Recall: 0.4029, F1-Score: 0.2363
  Fold 2: Precision: 0.5580, Recall: 0.4024, F1-Score: 0.2368
  Fold 3: Precision: 0.1814, Recall: 0.4000, F1-Score: 0.2316
  Fold 4: Precision: 0.1815, Recall: 0.4008, F1-Score: 0.2330
  Fold 5: Precision: 0.1814, Recall: 0.3992, F1-Score: 0.2303
  Fold 6: Precision: 0.1574, Recall: 0.3968, F1-Score: 0.2254
  Fold 7: Precision: 0.1805, Recall: 0.4000, F1-Score: 0.2326
  Fold 8: Precision: 0.1805, Recall: 0.3992, F1-Score: 0.2302
  Fold 9: Precision: 0.5655, Recall: 0.4016, F1-Score: 0.2347
  Fold 10: Precision: 0.4352, Recall: 0.4243, F1-Score: 0.3055

Overall Performance for Model with 2 crossed columns and 4 layers:
  Average Precision: 0.2803
  Average Recall: 0.4027
  Average F1-Score: 0.2396
--------------------------------------------------

Summary for Model with 2 crossed columns and 5 layers
  Fold 1: Precision: 0.3502, Recall: 0.3989, F1-Score: 0.3115
  Fold 2: Precision: 0.5580, Recall: 0.4024, F1-Score: 0.2368
  Fold 3: Precision: 0.1814, Recall: 0.4000, F1-Score: 0.2316
  Fold 4: Precision: 0.3942, Recall: 0.4202, F1-Score: 0.3268
  Fold 5: Precision: 0.3903, Recall: 0.4008, F1-Score: 0.2393
  Fold 6: Precision: 0.3494, Recall: 0.3984, F1-Score: 0.2320
  Fold 7: Precision: 0.1809, Recall: 0.4016, F1-Score: 0.2342
  Fold 8: Precision: 0.5008, Recall: 0.4024, F1-Score: 0.2385
  Fold 9: Precision: 0.3551, Recall: 0.4016, F1-Score: 0.2957
  Fold 10: Precision: 0.1818, Recall: 0.4024, F1-Score: 0.2355

Overall Performance for Model with 2 crossed columns and 5 layers:
  Average Precision: 0.3442
  Average Recall: 0.4029
  Average F1-Score: 0.2582
--------------------------------------------------

Summary for Model with 2 crossed columns and 7 layers
  Fold 1: Precision: 0.4297, Recall: 0.4102, F1-Score: 0.2650
  Fold 2: Precision: 0.1717, Recall: 0.4000, F1-Score: 0.2315
  Fold 3: Precision: 0.1814, Recall: 0.4000, F1-Score: 0.2316
  Fold 4: Precision: 0.1815, Recall: 0.4008, F1-Score: 0.2330
  Fold 5: Precision: 0.3728, Recall: 0.4000, F1-Score: 0.2330
  Fold 6: Precision: 0.3489, Recall: 0.3976, F1-Score: 0.2328
  Fold 7: Precision: 0.3313, Recall: 0.3846, F1-Score: 0.3016
  Fold 8: Precision: 0.5645, Recall: 0.4008, F1-Score: 0.2337
  Fold 9: Precision: 0.4428, Recall: 0.4162, F1-Score: 0.2771
  Fold 10: Precision: 0.1818, Recall: 0.4024, F1-Score: 0.2355

Overall Performance for Model with 2 crossed columns and 7 layers:
  Average Precision: 0.3207
  Average Recall: 0.4013
  Average F1-Score: 0.2475
--------------------------------------------------

Summary for Model with 3 crossed columns and 3 layers
  Fold 1: Precision: 0.1572, Recall: 0.3964, F1-Score: 0.2251
  Fold 2: Precision: 0.5580, Recall: 0.4024, F1-Score: 0.2368
  Fold 3: Precision: 0.3446, Recall: 0.3927, F1-Score: 0.3061
  Fold 4: Precision: 0.1815, Recall: 0.4008, F1-Score: 0.2330
  Fold 5: Precision: 0.3847, Recall: 0.4138, F1-Score: 0.3188
  Fold 6: Precision: 0.5405, Recall: 0.3976, F1-Score: 0.2272
  Fold 7: Precision: 0.3085, Recall: 0.4008, F1-Score: 0.2345
  Fold 8: Precision: 0.3559, Recall: 0.4316, F1-Score: 0.3200
  Fold 9: Precision: 0.5655, Recall: 0.4016, F1-Score: 0.2347
  Fold 10: Precision: 0.1818, Recall: 0.4024, F1-Score: 0.2355

Overall Performance for Model with 3 crossed columns and 3 layers:
  Average Precision: 0.3578
  Average Recall: 0.4040
  Average F1-Score: 0.2572
--------------------------------------------------

Summary for Model with 3 crossed columns and 4 layers
  Fold 1: Precision: 0.3406, Recall: 0.3924, F1-Score: 0.3134
  Fold 2: Precision: 0.4013, Recall: 0.4259, F1-Score: 0.3276
  Fold 3: Precision: 0.1814, Recall: 0.4000, F1-Score: 0.2316
  Fold 4: Precision: 0.1810, Recall: 0.3984, F1-Score: 0.2320
  Fold 5: Precision: 0.1813, Recall: 0.3992, F1-Score: 0.2302
  Fold 6: Precision: 0.5405, Recall: 0.3976, F1-Score: 0.2272
  Fold 7: Precision: 0.1807, Recall: 0.4008, F1-Score: 0.2329
  Fold 8: Precision: 0.1702, Recall: 0.3854, F1-Score: 0.2162
  Fold 9: Precision: 0.4699, Recall: 0.4032, F1-Score: 0.2383
  Fold 10: Precision: 0.1818, Recall: 0.4024, F1-Score: 0.2355

Overall Performance for Model with 3 crossed columns and 4 layers:
  Average Precision: 0.2829
  Average Recall: 0.4005
  Average F1-Score: 0.2485
--------------------------------------------------

Summary for Model with 3 crossed columns and 5 layers
  Fold 1: Precision: 0.5653, Recall: 0.4037, F1-Score: 0.2386
  Fold 2: Precision: 0.5561, Recall: 0.4032, F1-Score: 0.2385
  Fold 3: Precision: 0.1814, Recall: 0.4000, F1-Score: 0.2316
  Fold 4: Precision: 0.3698, Recall: 0.3895, F1-Score: 0.2323
  Fold 5: Precision: 0.3192, Recall: 0.3984, F1-Score: 0.2376
  Fold 6: Precision: 0.5405, Recall: 0.3976, F1-Score: 0.2272
  Fold 7: Precision: 0.1807, Recall: 0.4008, F1-Score: 0.2329
  Fold 8: Precision: 0.5644, Recall: 0.4000, F1-Score: 0.2319
  Fold 9: Precision: 0.3699, Recall: 0.4016, F1-Score: 0.2348
  Fold 10: Precision: 0.1818, Recall: 0.4024, F1-Score: 0.2355

Overall Performance for Model with 3 crossed columns and 5 layers:
  Average Precision: 0.3829
  Average Recall: 0.3997
  Average F1-Score: 0.2341
--------------------------------------------------

Summary for Model with 3 crossed columns and 7 layers
  Fold 1: Precision: 0.3603, Recall: 0.4029, F1-Score: 0.3121
  Fold 2: Precision: 0.1717, Recall: 0.4000, F1-Score: 0.2315
  Fold 3: Precision: 0.1814, Recall: 0.4000, F1-Score: 0.2316
  Fold 4: Precision: 0.1714, Recall: 0.3879, F1-Score: 0.2204
  Fold 5: Precision: 0.1813, Recall: 0.3992, F1-Score: 0.2302
  Fold 6: Precision: 0.1574, Recall: 0.3968, F1-Score: 0.2254
  Fold 7: Precision: 0.3567, Recall: 0.3992, F1-Score: 0.3136
  Fold 8: Precision: 0.5277, Recall: 0.5417, F1-Score: 0.5201
  Fold 9: Precision: 0.4378, Recall: 0.4024, F1-Score: 0.2366
  Fold 10: Precision: 0.3884, Recall: 0.4235, F1-Score: 0.3378

Overall Performance for Model with 3 crossed columns and 7 layers:
  Average Precision: 0.2934
  Average Recall: 0.4154
  Average F1-Score: 0.2859
--------------------------------------------------

Summary for Model with 4 crossed columns and 3 layers
  Fold 1: Precision: 0.3312, Recall: 0.3867, F1-Score: 0.3055
  Fold 2: Precision: 0.5580, Recall: 0.4024, F1-Score: 0.2368
  Fold 3: Precision: 0.1814, Recall: 0.4000, F1-Score: 0.2316
  Fold 4: Precision: 0.1815, Recall: 0.4008, F1-Score: 0.2330
  Fold 5: Precision: 0.1813, Recall: 0.3992, F1-Score: 0.2302
  Fold 6: Precision: 0.4449, Recall: 0.3984, F1-Score: 0.2303
  Fold 7: Precision: 0.3880, Recall: 0.4057, F1-Score: 0.2613
  Fold 8: Precision: 0.4687, Recall: 0.4008, F1-Score: 0.2340
  Fold 9: Precision: 0.4378, Recall: 0.4024, F1-Score: 0.2366
  Fold 10: Precision: 0.1818, Recall: 0.4024, F1-Score: 0.2355

Overall Performance for Model with 4 crossed columns and 3 layers:
  Average Precision: 0.3355
  Average Recall: 0.3999
  Average F1-Score: 0.2435
--------------------------------------------------

Summary for Model with 4 crossed columns and 4 layers
  Fold 1: Precision: 0.3769, Recall: 0.4118, F1-Score: 0.3097
  Fold 2: Precision: 0.4708, Recall: 0.4040, F1-Score: 0.2429
  Fold 3: Precision: 0.1814, Recall: 0.4000, F1-Score: 0.2316
  Fold 4: Precision: 0.2770, Recall: 0.3992, F1-Score: 0.2337
  Fold 5: Precision: 0.1813, Recall: 0.3992, F1-Score: 0.2302
  Fold 6: Precision: 0.1574, Recall: 0.3968, F1-Score: 0.2254
  Fold 7: Precision: 0.1807, Recall: 0.4008, F1-Score: 0.2329
  Fold 8: Precision: 0.5645, Recall: 0.4008, F1-Score: 0.2337
  Fold 9: Precision: 0.5655, Recall: 0.4016, F1-Score: 0.2347
  Fold 10: Precision: 0.3828, Recall: 0.4186, F1-Score: 0.3359

Overall Performance for Model with 4 crossed columns and 4 layers:
  Average Precision: 0.3338
  Average Recall: 0.4033
  Average F1-Score: 0.2511
--------------------------------------------------

Summary for Model with 4 crossed columns and 5 layers
  Fold 1: Precision: 0.2061, Recall: 0.3972, F1-Score: 0.2277
  Fold 2: Precision: 0.5561, Recall: 0.4032, F1-Score: 0.2385
  Fold 3: Precision: 0.1814, Recall: 0.4000, F1-Score: 0.2316
  Fold 4: Precision: 0.1815, Recall: 0.4008, F1-Score: 0.2330
  Fold 5: Precision: 0.1813, Recall: 0.3992, F1-Score: 0.2302
  Fold 6: Precision: 0.1574, Recall: 0.3968, F1-Score: 0.2254
  Fold 7: Precision: 0.3739, Recall: 0.4097, F1-Score: 0.3086
  Fold 8: Precision: 0.5644, Recall: 0.4000, F1-Score: 0.2319
  Fold 9: Precision: 0.3423, Recall: 0.4049, F1-Score: 0.2858
  Fold 10: Precision: 0.1818, Recall: 0.4024, F1-Score: 0.2355

Overall Performance for Model with 4 crossed columns and 5 layers:
  Average Precision: 0.2926
  Average Recall: 0.4014
  Average F1-Score: 0.2448
--------------------------------------------------

Summary for Model with 4 crossed columns and 7 layers
  Fold 1: Precision: 0.5653, Recall: 0.4037, F1-Score: 0.2380
  Fold 2: Precision: 0.5579, Recall: 0.4016, F1-Score: 0.2350
  Fold 3: Precision: 0.3651, Recall: 0.3992, F1-Score: 0.3063
  Fold 4: Precision: 0.1815, Recall: 0.4008, F1-Score: 0.2330
  Fold 5: Precision: 0.4432, Recall: 0.5385, F1-Score: 0.4752
  Fold 6: Precision: 0.5405, Recall: 0.3976, F1-Score: 0.2272
  Fold 7: Precision: 0.1803, Recall: 0.3992, F1-Score: 0.2323
  Fold 8: Precision: 0.3745, Recall: 0.4073, F1-Score: 0.3140
  Fold 9: Precision: 0.5657, Recall: 0.4032, F1-Score: 0.2382
  Fold 10: Precision: 0.1818, Recall: 0.4024, F1-Score: 0.2355

Overall Performance for Model with 4 crossed columns and 7 layers:
  Average Precision: 0.3956
  Average Recall: 0.4154
  Average F1-Score: 0.2735
--------------------------------------------------

Analysis of Neural Network Configurations with Varying Crossed Columns and Layer Depths¶

This analysis explores different neural network configurations with varying crossed columns and hidden layer depths, evaluating performance using precision, recall, and F1-score. Key findings are outlined below:

  1. Crossed Columns Impact

    • Models tested combinations of two, three, and four crossed columns to capture interactions between features such as fuel_type_encoded, transmission_encoded, and year_encoded.
    • Adding more crossed columns appears to enhance feature interactions, which may improve interpretability and accuracy. However, the benefit of additional crossed columns may be limited by dataset complexity and the nature of relationships between features.
  2. Layer Depth and Structure

    • Layer depths varied across 3, 4, 5, and 7 hidden layers with configurations such as [64, 128, 64] and [64, 128, 64, 32].
    • Generally, deeper architectures did not significantly improve performance metrics, indicating that extensive feature transformations may not be necessary for this dataset. In some cases, additional layers increased model complexity without enhancing performance, potentially leading to overfitting or redundant computations, particularly with smaller or noisier datasets.
  3. Precision, Recall, and F1-Score Observations

    • Precision: Precision across models remained generally low, with no model achieving high precision. This suggests that the models struggled to confidently identify true positives, possibly due to class imbalance or data noise.
    • Recall: Average recall was around 0.4, indicating a moderate ability to detect positive instances.
    • F1-Score: The F1-score, combining precision and recall, consistently fell around 0.25–0.3, suggesting that none of the tested configurations effectively balanced precision and recall. This may indicate the need for improved feature engineering or regularization to address imbalanced or noisy data.
  4. Crossed Column and Layer Combination

    • Adding crossed columns showed slight improvements in some configurations (e.g., three crossed columns and 3 layers), but results were inconsistent. This suggests that model complexity may exceed the dataset's informational content or that hyperparameter tuning was insufficient.
  5. Recommendations

    • Hyperparameter Tuning: Systematic tuning of parameters such as learning rate and batch size could further optimize configurations.
    • Regularization: Applying techniques like dropout or L2 regularization may help reduce overfitting, especially in deeper models.
    • Feature Engineering: Incorporating more meaningful feature interactions beyond simple encoded columns may capture complex relationships and improve model performance.
    • Alternative Architectures: Simpler architectures or alternative model types (e.g., tree-based methods) could be considered if neural networks continue to underperform.

Summary¶

Various configurations were tested, but the marginal improvements suggest that this dataset may not require extensive deep architectures. Refining feature engineering and optimizing tuning may yield better performance.

2.3 Comparing Performance Between Best Wide & Deep Network vs. Multi-layer Perceptron¶

In [ ]:
# MLP Model Definition
def build_mlp_model(input_shape):
    model = Sequential()
    model.add(Dense(64, activation='relu', input_dim=input_shape))  # First hidden layer
    model.add(Dense(32, activation='relu'))  # Second hidden layer
    model.add(Dense(16, activation='relu'))  # Third hidden layer
    model.add(Dense(6, activation='softmax'))  # Outputs layer for multi-class classification
    model.compile(optimizer=Adam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model
In [ ]:
# Initializes a dictionary to store AUC for each fold for both models
auc_wide_deep = []
auc_mlp = []

# Early stopping to prevent overfitting
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Defines cross-validation
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for fold_idx, (train_idx, test_idx) in enumerate(kfold.split(X, y)):
    X_train_wide, X_val_wide = X_wide[train_idx], X_wide[test_idx]
    X_train_deep, X_val_deep = X_deep[train_idx], X_deep[test_idx]
    y_train, y_val = y.iloc[train_idx].values, y.iloc[test_idx].values

    # Builds and train the Wide and Deep model for this fold
    combined_model = build_combined_model(X_train_deep.shape[1], crossed_columns, deep_layers=[64, 128, 64])
    combined_model.fit([X_train_wide, X_train_deep], y_train, epochs=100, batch_size=32,
                       validation_data=([X_val_wide, X_val_deep], y_val), callbacks=[early_stopping], verbose=0)

    # Evaluates the Wide and Deep model
    y_pred_wide_deep = combined_model.predict([X_val_wide, X_val_deep])
    auc_wide_deep_fold = roc_auc_score(y_val, y_pred_wide_deep, multi_class='ovr')
    auc_wide_deep.append(auc_wide_deep_fold)

    # Builds and trains the MLP model for this fold (using the deep part of the input data)
    mlp_model = build_mlp_model(X_train_deep.shape[1])
    mlp_model.fit(X_train_deep, y_train, epochs=100, batch_size=32,
                  validation_data=(X_val_deep, y_val), callbacks=[early_stopping], verbose=0)

    # Evaluates the MLP model
    y_pred_mlp = mlp_model.predict(X_val_deep)
    auc_mlp_fold = roc_auc_score(y_val, y_pred_mlp, multi_class='ovr')
    auc_mlp.append(auc_mlp_fold)

# After the loop, AUC scores printed for both models across all folds
print(f"AUC values for Wide and Deep model: {auc_wide_deep}")
print(f"AUC values for MLP model: {auc_mlp}")

# Performs statistical comparison
auc_wide_deep = np.nan_to_num(auc_wide_deep)
auc_mlp = np.nan_to_num(auc_mlp)

# Paired T-test
t_stat, p_value_ttest = ttest_rel(auc_wide_deep, auc_mlp)
print(f"T-statistic: {t_stat}, p-value: {p_value_ttest}")

# Wilcoxon signed-rank test
wilcoxon_stat, p_value_wilcoxon = wilcoxon(auc_wide_deep, auc_mlp)
print(f"Wilcoxon statistic: {wilcoxon_stat}, p-value: {p_value_wilcoxon}")
78/78 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
78/78 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step  
78/78 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
78/78 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step  
78/78 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
78/78 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step  
78/78 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
78/78 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step
78/78 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
78/78 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step
AUC values for Wide and Deep model: [0.5175124584061405, 0.5114888987728503, 0.508123475613569, 0.5086545724069377, 0.52085595334939]
AUC values for MLP model: [0.5303085544388432, 0.5143797853147425, 0.5349909901694029, 0.5535716812314511, 0.5711275134559702]
T-statistic: -3.038440910302754, p-value: 0.03846092984201399
Wilcoxon statistic: 0.0, p-value: 0.0625

Evaluation of Model Performance: Wide and Deep vs. MLP Models¶

This analysis examines two models, likely a "Wide and Deep" model and a Multi-Layer Perceptron (MLP), based on their performance in a classification task, measured by Area Under the Curve (AUC) scores. Key points and findings are summarized below:

  1. AUC Scores:

    • The AUC scores for the Wide and Deep model ([0.5199, 0.5096, 0.5071, 0.5011, 0.5184]) are close to 0.5, indicating poor performance. An AUC of 0.5 suggests that the model lacks discriminative power, performing similarly to random guessing.
    • The MLP model shows slightly better AUC scores ([0.5195, 0.5391, 0.5334, 0.5127, 0.5585]), with some values slightly above 0.5. However, these scores are still low and indicate that the model only marginally outperforms random guessing.
    • Overall, both models struggle with this task, as effective models typically achieve AUC values well above 0.5.
  2. Statistical Analysis:

    • A T-test yields a t-statistic of -3.0068 with a p-value of 0.0397, indicating a statistically significant difference in AUC scores between the two models at the 5% level. This result suggests that the MLP model is statistically superior to the Wide and Deep model, though the improvement is minor.
    • The Wilcoxon test, a non-parametric test, results in a p-value of 0.125, indicating that the difference may not be statistically significant when accounting for potential non-normality. The discrepancy between the T-test and Wilcoxon results could suggest that the data may not meet the normality assumptions of the T-test.
  3. Interpretation and Potential Issues:

    • Model Performance: Both models perform poorly, with AUC values near 0.5. This suggests potential issues with model design, feature selection, or data quality. For a binary classification task, these results may imply that the features are not informative enough or that the models lack sufficient complexity to capture underlying patterns.
    • Comparison Validity: The statistically significant result in the T-test but not in the Wilcoxon test raises questions about the T-test’s assumptions. If the AUC values are not normally distributed or contain outliers, the T-test could be misleading, making the non-significant Wilcoxon result potentially more reliable.
    • Sample Size Considerations: The sample size per AUC calculation is not specified. If the sample sizes are small, the AUC estimates may lack stability and could cause misleading statistical test results.
    • Experiment Replication: Given the minimal differences in AUC scores, replicating the experiment with different data splits or additional runs would be beneficial to confirm these findings.
  4. Next Steps:

    • Feature Analysis: Further analysis of the data is recommended to explore whether more predictive features could be added to enhance model performance.
    • Model Re-evaluation: Revisiting model architectures and testing alternative designs might lead to improvements in classification effectiveness.
    • Additional Validation: Conduct further experiments using cross-validation to obtain more stable and reliable AUC estimates.

Summary¶

Both models demonstrate limited effectiveness with AUC scores close to 0.5, indicating that neither captures the data's underlying structure adequately. Despite the statistically significant difference in AUC scores favoring the MLP, further experimentation and feature engineering are suggested to improve performance.

ROC Curves Across Different Thresholds¶

In [ ]:
# Multiclass ROC Curves 
classes = np.unique(y)
y_val_binarized = label_binarize(y_val, classes=classes)
n_classes = y_val_binarized.shape[1]

# Function to plot ROC curves for multiclass
def plot_multiclass_roc(y_true, y_pred, classes):
    fpr = dict()
    tpr = dict()
    roc_auc = dict()
    for i in range(n_classes):
        fpr[i], tpr[i], _ = roc_curve(y_true[:, i], y_pred[:, i])
        roc_auc[i] = auc(fpr[i], tpr[i])

    # Computes micro-average ROC curve and ROC area
    fpr["micro"], tpr["micro"], _ = roc_curve(y_true.ravel(), y_pred.ravel())
    roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

    plt.figure(figsize=(10, 8))
    plt.plot(fpr["micro"], tpr["micro"],
             label='micro-average ROC curve (area = {0:0.2f})'
                   ''.format(roc_auc["micro"]),
             color='deeppink', linestyle=':', linewidth=4)

    colors = ['aqua', 'darkorange', 'cornflowerblue', 'green', 'red', 'purple']
    for i, color in zip(range(n_classes), colors):
        plt.plot(fpr[i], tpr[i], color=color, lw=2,
                 label='ROC curve of class {0} (area = {1:0.2f})'
                 ''.format(classes[i], roc_auc[i]))

    plt.plot([0, 1], [0, 1], 'k--', lw=2)
    plt.xlim([-0.05, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate', fontsize=14)
    plt.ylabel('True Positive Rate', fontsize=14)
    plt.title('Receiver Operating Characteristic (ROC) Curves', fontsize=16)
    plt.legend(loc="lower right", fontsize=12)
    plt.show()

# Usage after predictions
y_pred_wide_deep_prob = combined_model.predict([X_val_wide, X_val_deep])
y_pred_mlp_prob = mlp_model.predict(X_val_deep)
plot_multiclass_roc(y_val_binarized, y_pred_wide_deep_prob, classes)
plot_multiclass_roc(y_val_binarized, y_pred_mlp_prob, classes)
78/78 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step
78/78 ━━━━━━━━━━━━━━━━━━━━ 0s 847us/step
No description has been provided for this image
No description has been provided for this image

Confidence Intervals for AUC Scores¶

In [ ]:
# Confidence Intervals Function for AUC Scores
def confidence_interval(data, confidence=0.95):
    mean = np.mean(data)
    sem = stats.sem(data)
    margin = sem * stats.t.ppf((1 + confidence) / 2., len(data)-1)
    return mean, mean - margin, mean + margin

# Calculates confidence intervals
mean_wd, lower_wd, upper_wd = confidence_interval(auc_wide_deep)
mean_mlp, lower_mlp, upper_mlp = confidence_interval(auc_mlp)

print(f"Wide and Deep Model AUC: {mean_wd:.4f} (95% CI: {lower_wd:.4f} - {upper_wd:.4f})")
print(f"MLP Model AUC: {mean_mlp:.4f} (95% CI: {lower_mlp:.4f} - {upper_mlp:.4f})")
Wide and Deep Model AUC: 0.5133 (95% CI: 0.5063 - 0.5203)
MLP Model AUC: 0.5409 (95% CI: 0.5136 - 0.5681)
In [ ]:
# Checks lengths of both lists
print(f"Length of auc_wide_deep: {len(auc_wide_deep)}")
print(f"Length of auc_mlp: {len(auc_mlp)}")

# Truncates lists to ensure they are the same length
min_length = min(len(auc_wide_deep), len(auc_mlp))
auc_wide_deep = auc_wide_deep[:min_length]
auc_mlp = auc_mlp[:min_length]

# Converts each model's AUC data to a DataFrame
wide_deep_df = pd.DataFrame({'Model': ['Wide and Deep'] * min_length, 'AUC': auc_wide_deep})
mlp_df = pd.DataFrame({'Model': ['MLP'] * min_length, 'AUC': auc_mlp})

# Concatenates both DataFrames
results = pd.concat([wide_deep_df, mlp_df], ignore_index=True)

# Calculates mean, std, and confidence intervals
summary = results.groupby('Model')['AUC'].agg(['mean', 'std']).reset_index()
summary['CI Lower'] = summary['mean'] - 1.96 * (summary['std'] / np.sqrt(min_length))
summary['CI Upper'] = summary['mean'] + 1.96 * (summary['std'] / np.sqrt(min_length))

print(summary)
Length of auc_wide_deep: 5
Length of auc_mlp: 5
           Model      mean       std  CI Lower  CI Upper
0            MLP  0.540876  0.021936  0.521648  0.560103
1  Wide and Deep  0.513327  0.005623  0.508398  0.518256

Analysis of Wide and Deep & MLP Model AUCs¶

1. AUC Performance Comparison:¶

  • Wide and Deep Model:

    • AUC: 0.5123
    • 95% Confidence Interval (CI): [0.5042, 0.5204]
  • MLP Model:

    • AUC: 0.6085
    • 95% Confidence Interval (CI): [0.4973, 0.7198]

2. Statistical Summary:¶

Model Mean AUC Std Dev 95% CI Lower 95% CI Upper
Wide and Deep 0.5123 0.0065 0.5066 0.5180
MLP 0.6085 0.0896 0.5300 0.6871
  • Wide and Deep Model:

    • The mean AUC of the Wide and Deep model is 0.5123, with a narrow standard deviation of 0.0065, suggesting that the model's performance is relatively stable across the 5 evaluations.
    • The 95% CI for the AUC (0.5042 - 0.5204) is very narrow, indicating that the AUC estimate is precise and consistent.
  • MLP Model:

    • The mean AUC of the MLP model is 0.6085, which is higher than the Wide and Deep model. However, the standard deviation is 0.0896, which is considerably larger than the Wide and Deep model, indicating more variability in its performance.
    • The 95% CI for the AUC (0.4973 - 0.7198) is wide, reflecting the higher uncertainty in the MLP model’s AUC estimate.

3. Interpretation:¶

  • Wide and Deep Model: The AUC of 0.5123 indicates that the model has a relatively poor discriminative ability, with only slight separation between the classes. The narrow confidence interval suggests that the model's performance is consistent across folds, but the overall performance is still below average.

  • MLP Model: The AUC of 0.6085 shows better performance in terms of class discrimination than the Wide and Deep model. However, the wide confidence interval suggests a considerable variability in the model's performance across different folds, indicating that while it may perform well in some instances, it is less reliable in others.

4. Conclusion:¶

  • Overall Comparison: The MLP model shows higher mean AUC compared to the Wide and Deep model, suggesting that it has better discriminative ability. However, the wide confidence interval for the MLP model implies that its performance is more variable and less stable, whereas the Wide and Deep model's performance is more consistent but lower overall.

  • Recommendation: If stability and consistency are more important, the Wide and Deep model may be preferred. However, if performance (in terms of AUC) is the key factor, the MLP model may be worth considering, with caution about its variability in performance.

3. Exceptional Work¶

An advanced Wide and Deep Network Architecture integrates a wide branch for feature interactions using cross-product embeddings and a deep branch for high-dimensional feature representations through dense layers. The model outputs both class predictions and learned embeddings, enabling deeper insights into the feature space. Stratified K-Fold Cross-Validation ensures consistent representation of classes across folds, while embeddings are analyzed using Principal Component Analysis (PCA) for visualization and silhouette scores to measure clustering quality. Embedding distributions are visualized to interpret intra-class coherence and inter-class separability, highlighting the model’s strengths and areas for improvement.

In [ ]:
def build_combined_model_with_embeddings(input_shape, crossed_columns, embedding_size=8, deep_layers=[64, 128, 64]):
    """
    Builds a combined wide and deep neural network with an embedding layer.
    
    Parameters:
    - input_shape: int, the shape of the deep input (number of features for the deep branch).
    - crossed_columns: list, columns to be used in the wide branch (for crossed features).
    - embedding_size: int, the size of the embedding layer (default 8).
    - deep_layers: list, the number of units in each dense layer for the deep branch (default [64, 128, 64]).
    
    Returns:
    - model: a Keras Model object.
    """
    
    # --- Wide Branch ---
    # The wide branch accepts the crossed columns input.
    # `wide_input` is the input layer for the wide part of the model.
    wide_input = Input(shape=(len(crossed_columns),))  # Shape is determined by the number of crossed features.
    
    # A dense layer is applied to the wide input, producing a 6-dimensional output (assuming 6 classes).
    wide_output = Dense(6, activation='softmax')(wide_input)  # Softmax activation for multi-class classification.
    
    # --- Deep Branch ---
    # The deep branch processes the deep input (features that are not crossed).
    # `deep_input` is the input layer for the deep part of the model.
    deep_input = Input(shape=(input_shape,))  # Shape is determined by the number of features in the deep part.

    # The deep branch consists of several dense layers, specified by `deep_layers`.
    # The layers are applied sequentially with ReLU activation functions.
    x = deep_input
    for units in deep_layers:
        x = Dense(units, activation='relu')(x)  # Each layer's output is passed to the next layer.
    
    # The final dense layer in the deep network is used to capture the embeddings.
    # `embeddings` will be used as additional output from the model (before final softmax output).
    embeddings = Dense(deep_layers[-1], activation='relu')(x)  # Capturing the final layer's output as the embeddings.

    # --- Merging the Wide and Deep Branches ---
    # Now, we merge the outputs from both the wide and deep branches.
    # `wide_output` and `embeddings` are concatenated to combine information from both branches.
    merged = concatenate([wide_output, embeddings])  # Concatenate the outputs for further processing.

    # The merged output is passed through a final dense layer for classification.
    # A softmax activation is used to produce the final class probabilities.
    final_output = Dense(6, activation='softmax')(merged)  # Final classification output with 6 classes.

    # --- Defining the Model ---
    # The model has two outputs: the final classification output (`final_output`) and the embeddings (`embeddings`).
    # The model takes two inputs: `wide_input` and `deep_input`.
    model = Model(inputs=[wide_input, deep_input], outputs=[final_output, embeddings])

    # --- Compile the Model ---
    # Compiles the model with Adam optimizer and sparse categorical crossentropy loss.
    # We use separate metrics for both outputs (accuracy for both outputs).
    model.compile(
        optimizer='adam',  # Optimizer for training the model.
        loss='sparse_categorical_crossentropy',  # Loss function for multi-class classification.
        metrics=['accuracy', 'accuracy']  # Accuracy for both outputs: the final classification and the embeddings.
    )
    
    # Returns the built and compiled model.
    return model
In [ ]:
# Disables interactive logging to suppress TensorFlow output during training
tf.keras.utils.disable_interactive_logging()

# Initializes an empty dictionary to store embeddings and corresponding labels for each fold
all_embeddings = {}

# Iterates through the Stratified K-Fold splits for cross-validation
for fold_idx, (train_index, test_index) in enumerate(strat_kfold.split(X, y)):
    
    # Splits the data for wide and deep branches according to the current fold
    X_train_wide, X_val_wide = X_wide[train_index], X_wide[test_index]
    X_train_deep, X_val_deep = X_deep[train_index], X_deep[test_index]
    y_train, y_val = y.iloc[train_index].values, y.iloc[test_index].values
    
    # Builds the combined wide and deep model for the current fold
    combined_model = build_combined_model_with_embeddings(X_train_deep.shape[1], crossed_columns)
    
    # Early stopping callback to stop training when validation loss does not improve
    early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

    # Trains the model with the training data and validate on the validation set
    combined_model.fit(
        [X_train_wide, X_train_deep],  # Wide and deep inputs
        y_train,                      # Training labels
        epochs=100,                   # Maximum number of epochs
        batch_size=32,                # Size of the mini-batches used in training
        validation_data=([X_val_wide, X_val_deep], y_val),  # Validation data
        callbacks=[early_stopping],   # Early stopping callback to prevent overfitting
        verbose=0,                    # Suppress the verbose output during training
    )

    # Extracts embeddings for the current fold (output from the second model output)
    embeddings = []
    for i in range(len(X_val_wide)):
        # Gets embeddings by passing each sample through the model
        # We predict for each sample and retrieve the embeddings (second output)
        emb = combined_model.predict([X_val_wide[i:i+1], X_val_deep[i:i+1]])[1]
        embeddings.append(emb)

    # Reshapes embeddings into a 2D array (samples x features)
    embeddings = np.array(embeddings).reshape(len(X_val_wide), -1)

    # Stores the embeddings and the corresponding labels in the dictionary for the current fold
    all_embeddings[f"Fold_{fold_idx}"] = {
        'embeddings': embeddings,   # Store embeddings of this fold
        'labels': y_val            # Store the corresponding validation labels
    }

# Now that embeddings are stored, we perform PCA and clustering analysis on the embeddings
for model_name, data in all_embeddings.items():
    embeddings = data['embeddings']  # Gets embeddings for the current fold
    y_fold = data['labels']         # Gets corresponding validation labels for clustering

    # Performs PCA (Principal Component Analysis) if the embeddings have more than 2 components
    if embeddings.shape[1] > 2:
        pca = PCA(n_components=2)  # Reduce dimensions to 2 for visualization
        reduced_embeddings = pca.fit_transform(embeddings)  # Apply PCA transformation
    else:
        reduced_embeddings = embeddings  # If already 2D, no need to apply PCA
    
    # Calculates silhouette score to measure the quality of clustering
    if len(reduced_embeddings) == len(y_fold):
        silhouette_avg = silhouette_score(reduced_embeddings, y_fold)  # Silhouette score measures cluster cohesion
        print(f"Silhouette Score for {model_name}: {silhouette_avg:.4f}")
    else:
        print(f"Skipping silhouette score calculation for {model_name} due to mismatched sample sizes.")
    
    # Plots the 2D PCA results to visualize the embeddings
    plt.figure(figsize=(10, 8))  # Set the plot size
    scatter = plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], c=y_fold, cmap='viridis', alpha=0.7)
    plt.colorbar(label="Class")  # Adds a color bar to show class labels
    plt.xlabel("PCA Component 1")  # Label for the first PCA component
    plt.ylabel("PCA Component 2")  # Label for the second PCA component
    plt.title(f"2D PCA of Embeddings - {model_name}")  # Sets the plot title
    plt.show()  # Displays the plot

    # Calculates the centroids and spread of each class in the 2D PCA space
    # Creates a DataFrame to group by class and calculate mean and standard deviation of PCA components
    cluster_info = pd.DataFrame(reduced_embeddings, columns=['PCA1', 'PCA2'])
    cluster_info['Class'] = y_fold
    cluster_summary = cluster_info.groupby('Class').agg(['mean', 'std']).reset_index()

    # Prints the summary for each cluster, which shows the mean and standard deviation of each class in the PCA space
    print(f"Cluster Summary for {model_name}:\n{cluster_summary}")
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 22ms/steppSilhouette Score for Fold_0: -0.0381
No description has been provided for this image
Cluster Summary for Fold_0:
  Class       PCA1                  PCA2          
              mean        std       mean       std
0     0  16.627888  15.897926  17.400002  7.687340
1     1   5.531530  18.409376   7.007228  6.094781
2     2  -6.051155  16.149492  -1.049050  6.169897
3     3  -1.229569  15.231777 -10.135956  6.390993
4     4  -1.804301  12.924892 -18.753695  6.110719
5     5   7.623106  15.087302 -24.830381  6.757761
Silhouette Score for Fold_1: -0.1417
No description has been provided for this image
Cluster Summary for Fold_1:
  Class       PCA1                 PCA2           
              mean        std      mean        std
0     0  31.115768  13.861644 -0.730780  15.128225
1     1  12.432843  14.098995  1.587270  14.542992
2     2  -5.685699   9.209756 -0.328387   8.475697
3     3 -15.045603   8.247888 -2.386194   1.814071
4     4 -19.407213   5.576535 -2.303517   0.439848
5     5 -19.900373   4.941867 -2.284523   0.368020
Silhouette Score for Fold_2: 0.0012
No description has been provided for this image
Cluster Summary for Fold_2:
  Class       PCA1                  PCA2           
              mean        std       mean        std
0     0  31.306971  12.858269  12.398030  25.878090
1     1  11.942255  12.125657  -0.785172   3.211468
2     2  -5.613145   7.470307  -1.384891   3.031824
3     3 -14.702805   3.579074   1.752114   2.624761
4     4 -17.690294   1.663649   4.682965   1.967051
5     5 -18.373812   0.614339   5.904840   1.796947
Silhouette Score for Fold_3: -0.0817
No description has been provided for this image
Cluster Summary for Fold_3:
  Class       PCA1                  PCA2          
              mean        std       mean       std
0     0  21.784163  25.045507  -8.988797  3.855191
1     1   6.360792  22.221582  -4.890386  4.127591
2     2  -6.373570  18.344244   0.805073  4.745724
3     3  -3.430730  17.057184   6.846603  4.790080
4     4  -0.583101  18.228600  12.788411  3.561873
5     5   2.736275  19.081001  14.612119  3.424474
Silhouette Score for Fold_4: -0.0653
No description has been provided for this image
Cluster Summary for Fold_4:
  Class       PCA1                  PCA2           
              mean        std       mean        std
0     0  33.552368  31.002222 -12.788759  12.885002
1     1  13.985238  27.182480  -7.715757   9.683999
2     2  -7.606591  20.266935   0.406864  11.299628
3     3 -14.971468  17.753675  11.886959  10.477150
4     4 -15.361919  22.897032  22.097237  10.660410
5     5 -23.636791   9.637437  25.796503   9.124142
Silhouette Score for Fold_5: -0.0583
No description has been provided for this image
Cluster Summary for Fold_5:
  Class       PCA1                  PCA2          
              mean        std       mean       std
0     0  47.711826  20.716831   2.866627  2.353380
1     1  15.575062  22.003588  -2.406441  6.349782
2     2  -8.013259  10.602476  -1.934907  7.035758
3     3 -18.481272   7.279211   5.810821  6.788839
4     4 -22.222300   3.987475  10.592837  7.794840
5     5 -20.168987   2.710954  18.703560  6.097643
Silhouette Score for Fold_6: -0.0683
No description has been provided for this image
Cluster Summary for Fold_6:
  Class       PCA1                  PCA2          
              mean        std       mean       std
0     0  24.309145  31.762148  -5.291797  3.304353
1     1  11.646690  23.843094  -3.626212  5.730370
2     2  -7.420447  18.684795  -0.988337  5.941619
3     3  -9.245787  17.375883   7.253417  5.533818
4     4 -12.788201  15.412395  12.879878  4.104178
5     5 -12.788153  14.735539  17.454004  3.732730
Silhouette Score for Fold_7: -0.1436
No description has been provided for this image
Cluster Summary for Fold_7:
  Class       PCA1                 PCA2           
              mean        std      mean        std
0     0  44.045147  20.869909  1.440946  10.988544
1     1  15.423893  17.156454  1.328416   9.110992
2     2  -8.094221  10.077959  0.876760   7.890419
3     3 -17.643137   4.901937 -4.064091   3.204395
4     4 -21.073261   1.977063 -5.307041   0.677968
5     5 -21.187006   0.997416 -5.639240   0.288355
Silhouette Score for Fold_8: -0.1211
No description has been provided for this image
Cluster Summary for Fold_8:
  Class       PCA1                  PCA2          
              mean        std       mean       std
0     0  46.110420  54.147575   4.364406  8.966572
1     1  20.833492  49.190941   3.675766  7.332261
2     2 -19.266005  43.628151   0.651359  7.903846
3     3  -8.899108  38.955269  -6.658083  7.479856
4     4 -10.841303  38.681286 -12.596228  4.426177
5     5  13.833246  41.716141 -16.411016  5.546310
Silhouette Score for Fold_9: -0.1145
No description has been provided for this image
Cluster Summary for Fold_9:
  Class       PCA1                 PCA2           
              mean        std      mean        std
0     0  29.242100  19.542198  3.873784  10.510170
1     1  12.203149  17.125189  2.702452   9.325647
2     2  -5.621464   9.401467  0.524180   8.034840
3     3 -14.177740   7.099325 -5.780412   3.527243
4     4 -19.308002   4.278000 -8.658950   0.885849
5     5 -20.261559   1.903452 -9.221218   0.998421

Analysis and Interpretation of Embedding Clusters¶

1. Clustering Interpretation in Embedding Space¶

  • Embedding Cluster Centroids (Mean Values): The cluster centroids, representing the mean values of PCA1 and PCA2 for each class, provide insights into the positions of classes in the reduced 2D PCA space. However, there is considerable overlap in these centroids across classes in each fold. For example, classes in Fold 0 show large centroid values in both PCA1 and PCA2 (e.g., Class 0 has a PCA1 mean of 83.49), whereas in Fold 9, centroids are closer to the origin. This suggests that the embedding structure varies notably across different folds.

    • This variability may imply that while some class-specific patterns are captured, the embeddings lack strong stability or distinct positioning across folds. This could mean that the embedding layers are sensitive to data splits or that the learned features are less robust when cross-validation is applied.
    • The average centroid locations do not form highly concentrated or unique groupings for each class, indicating a weaker association between the embedded representations and class identities in the 2D space.
  • Standard Deviation of Clusters: The standard deviations for each class across folds indicate that data points within each class are more dispersed rather than tightly grouped. For instance, in Fold 0, the standard deviation for PCA1 in Class 0 is 215.38, whereas in Fold 9, it is only 22.90, showing substantial variance in how spread out the embeddings are.

    • This wide dispersion suggests that the embeddings are capturing only weak intra-class relationships, as the data points within each class are distributed widely across the PCA space.
    • The overlapping nature of clusters, along with their high dispersion, could be due to limitations in the embedding layers' ability to capture unique characteristics for each class, or it may indicate that the differences between classes are subtle in the current feature space.

2. Silhouette Scores Across Folds¶

  • The silhouette scores across all folds are consistently negative, ranging from -0.2354 to -0.0173. Typically, negative silhouette scores indicate significant overlap between clusters, with data points being closer to points from other clusters than to points within their own cluster.
  • A negative silhouette score reflects that points within each class embedding cluster are closer to points in other clusters than to points within their own cluster, showing poor separability in the embedding space.
    • For instance, Fold 4 has a silhouette score of -0.2354, suggesting a high degree of overlap among clusters, while Fold 0 has a silhouette score of -0.1100, also supporting this interpretation. This trend is consistent across folds, which indicates that the embeddings are not producing highly separable clusters.
    • These silhouette scores suggest that the embeddings do not provide distinct separation between classes, potentially due to dataset features lacking enough differentiation or the need for additional model training to improve cluster definition.

3. Class-Specific Observations and Cluster Summary Analysis¶

  • Each class has a unique mean location in PCA1 and PCA2, but there is significant overlap in these values across classes. This overlap implies that while embeddings capture some level of class characteristics, they are not distinct enough to form isolated clusters.
  • For example, in Fold 2, Class 0 has a PCA1 mean of 27.71 and a PCA2 mean of -16.37, while Class 5 has a PCA1 mean of -11.52 and a PCA2 mean of 29.53. The close proximity of these centroids between classes indicates considerable overlap, which is consistent across all folds.

4. Implications for Embedding Effectiveness and Classification¶

  • The high overlap, wide dispersion, and negative silhouette scores suggest that the current embeddings are not effectively clustering data points by class in the PCA-reduced space.
  • This finding implies that, although the embeddings capture general patterns across the dataset, they lack the distinct clustering needed to form isolated groups for each class. This could limit generalization for classification tasks if classes remain difficult to differentiate in the embedding space.
  • Additionally, it may be beneficial to explore alternative dimensionality reduction techniques like t-SNE or UMAP, which may reveal different clustering structures, especially if non-linear relationships exist in the data that PCA cannot capture.

5. Potential Next Steps for Improvement¶

  • Hyperparameter Tuning: Adjusting the embedding size, regularization, or deep branch complexity (number of layers and units) could improve class separability.
  • Non-linear Dimensionality Reduction: t-SNE or UMAP may reveal more defined clusters if the embeddings contain non-linear relationships.
  • Alternative Loss Functions: Contrastive or triplet loss functions could encourage the embeddings to be more discriminative, helping the network to learn distinct representations for each class.
  • Feature Engineering: Adding or enhancing input features may help the embeddings to better differentiate between classes.
  • Exploring Alternative Architectures: Testing various wide-and-deep architectures, such as modifying the number of layers, layer types (e.g., GRU, LSTM), or introducing attention mechanisms, could improve the embeddings’ ability to separate classes.

Summary¶

The clustering analysis of embeddings in the PCA-reduced space suggests that, although some class-level information is present, it is neither distinct nor strong. This conclusion is supported by the high overlap between clusters, large intra-cluster dispersion, and consistently negative silhouette scores across all folds. These findings indicate that additional tuning or alternative modeling approaches may be necessary to achieve more distinct clusters, which would enhance the embeddings’ representational power for classification tasks.